### <ins>Mini-Project: Amsterdam House Price Analysis with Decision Tree Variations and SGD Linear Regression</ins><br>
<i> -This notebook is to see as personal exercise-

Here we work with the data set "Amsterdam House Price Prediction" <i>from Kaggle (https://www.kaggle.com/datasets/thomasnibb/amsterdam-house-price-prediction).</i><br>
The overarching goal is to create a model that <b>predicts the market price of houses</b> based on the information we get from the aforementioned dataset. 
<br>
<br>The dataset contains:

<ul>
<li>address</li>
<li><b>zip-code</b></li>
<li><b>price</b></li>
<li><b>space</b></li>
<li><b>room count</b></li>
<li>lat/lon coordinates</li>
</ul>

<br>
of <b>924 houses</b> within the amsterdam region. <b>Price</b> will be <b>our target</b> to predict and we will focus on the <b>zip-code, space and room count</b> as <b>our predictors</b><br>
<br><i>Note: To plot Amsterdam areas by zip code we will use a second source for the coordinates (https://public.opendatasoft.com/explore/dataset/georef-netherlands-postcode-pc4/export/).</i><br>
<br>
Since Price is an continues numerical parameter we will use regression models for our predictions. <i>If these wont provide reasonable results we will think about transforming the target price into a categorical parameter which might improve predictions.</i>
<br>
<br>
<i>.... OK! lets get started</i>
<br>
<br>
First we load all packages we will need:

In [None]:
# Loading all Packages
# General Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import json
import plotly.express as px
import scipy as spy
import seaborn as sns
import random
from unicodedata import category

# ML Packages
import sklearn as sk
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDRegressor
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, VotingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score,mean_squared_error,mean_absolute_error
from sklearn.model_selection import GridSearchCV
import xgboost as xgb
from sklearn.preprocessing import OneHotEncoder
import pickle



#### PART I: Exploring the Data
We load the .csv (downloaded from Kaggle) to a dataframe and check the data formats and missing values

In [None]:
# Loading dataframe
data = pd.read_csv("app/src/HousingPrices-Amsterdam-August-2021.csv")

The first look shows us that we have 3 features that are potential good estimators: Rooms, Area (aka house size in m²) and the Zip Code (specifying the location of the house)

In [None]:
# First look at data format and check for missing values
display(data.head(5))
display(data.info())
if 0 in data == True:
    print("Columns contain Zeroes")
else:
    print("Columns contain NO Zeroes")

We see that, except for the first column, all columns are sensical with representative names. Since "Unnamed: 0" appears to be just an index column which we will check by confirming that the values continuously increase by 1 <br>
<br>
Also we see that only Price is missing values. Since those are only 4 we will exclude the rows entirely 

In [None]:
# Calculate mean of Unnamed and compare to mean of simple continuous series with increase of 1 - indication 1 for Unnamed = Index
print("Mean of 'Unnamed: 0': ",data["Unnamed: 0"].mean())
print("Mean of a continuous number series of 'Unnamed: 0' length: ", sum(range(data["Unnamed: 0"].iat[0],data["Unnamed: 0"].iat[-1]+1))/len(data["Unnamed: 0"]))

# Calculate slope of Unnamed showing slope is equal 1 - indication 2 for Unnamed = Index
test_slope = spy.stats.linregress(data["Unnamed: 0"],data.index)[0]
print("The slope is: ", test_slope)

# Plotting Unnamed to visualize continuity and slope of Unnamed 
plt.figure(figsize=(3,3))
sns.lineplot(data["Unnamed: 0"])
plt.title("Column is Contineously increasing by 1")
plt.xlabel("Pandas Index")
plt.xlim(0,925)
plt.ylabel("Value")
plt.ylim(0,925)
plt.show()

# Renaming Unnamed column and drop NaN rows
data = data.drop(["Unnamed: 0","Address","Lon","Lat"], axis=1)
data = data.drop(data.loc[data["Price"].isna() == True,"Price"].index)



Since the mean of "Unnamed: 0" is equal to the mean of a simple continuous number series counting from 1 to 925 and the line plot shows straight line with a slop of 1 we conclude that this column is indeed just an index column. The column is therefore renamed to "index"

Checking if the size of the house (area) and room count (room) correlate with the house price shows that both can serve as relatively good estimators (room r=0.62; area r=0.83 <br>
<br>
First we look at a quick correlation matrix and additionally we look at the data with a scatter plot

In [None]:
#first look at correlations
mask = np.array([[1,1,1],[0,1,1],[0,0,1]])
sns.heatmap(data[["Price","Area","Room"]].corr(method="pearson"), vmin=0, vmax=1, mask=mask)
plt.show()

In [None]:
fig2, axs = plt.subplots(ncols=2)
sns.regplot(data=data,x="Room",y="Price", ax=axs[0])
sns.regplot(data=data,x="Area",y="Price", ax=axs[1])
print("r value for Room vs Price: ", spy.stats.linregress(data["Room"], data["Price"])[2])
print("r value for Area vs Price: ", spy.stats.linregress(data["Area"], data["Price"])[2])
plt.show()

Lets have a look at the zip code data. It is a very appropriate guess that the zip code is a good estimator for the price of a house. This is usally the case of bigger cities like amsterdam is. However when we look up the how zip codes are constructed for the netherlands we see that the code specifies an area down to a street. This is way to specific for the data we have available here. Using this zip codes straight away would dilute the data set, meaning we get for many zip codes only one or two houses to work with. Therefor we first decode the zip to its first 4 numbers which specify a specific area within the city (70 areas).

In [None]:
data["Region"] = data["Zip"].str[0:4].astype(int)
data["Price"] = data["Price"].astype(float)
data["Area"] = data["Area"].astype(float)
data["Room"] = data["Room"].astype(int)
data = data.drop(["Zip"], axis=1)

As expected simply plotting a bar plot representing each region clearly shows that some regions sell for more.<br>
<br>
<i>Note: Since we are interested in the mean we plot the SEM)</i>

In [None]:
fig3, ax = plt.subplots(figsize=(20,5))
fig3 = sns.barplot(data=data,x="Region",y="Price", estimator="median", errorbar="se", color="lightblue")#, order= Price_columns.columns.sort_values())
fig3.set_xticklabels(labels=data["Region"].unique() ,fontdict={"fontsize":"8","rotation":"vertical"})
fig3.set_title("Median Price of each Region (by Zip)")
fig3.set_ylabel("Price")
fig3.set_xlabel("Region (Zip Code)")
fig3.set(ylim=(0,(2*10**6)))
plt.show()

To check if there is truly a significant difference within the regions (groups) we perform a quick simple ANOVA test<br>
We will see that the p-value is < 0.05, thus we can reject H0 and confirmed that there is a significant difference between the groups<br>
<br>
<i>Note: For that we first have to quickly transform the data into a list without NaNs so we can use the Scipy function<i>

In [None]:
Regio_Price = data[["Region","Price"]].copy()
Regio_Price["Region"] = Regio_Price["Region"].astype(str)

In [None]:
region_list = Regio_Price.Region.unique()
All_Regio_Price = []
for i in region_list:

    All_Regio_Price.append(list(Regio_Price["Price"][Regio_Price["Region"]==i]))


In [None]:
print("p-value of ANOVA test: ", spy.stats.f_oneway(*All_Regio_Price)[1])

To get a better idea about the regions within amsterdam we us plotly and geojson data to project every region we use onto a map

In [None]:

geodata=json.load(open("app/src/georef-netherlands-postcode-pc4_simple.geojson","r"))

#plotly needs specificly "id" within features so we have to copy them from properties to features
for i in geodata["features"]:
    i["id"]=i["properties"]["pc4_code"]

In [None]:
#get zip codes from geojson data
zip_list = []
for i in geodata["features"]:
    zip_list.append(i["properties"]["pc4_code"])

#check if every zip code in data set is present in geodata
for i in data["Region"].unique():
    if str(i) not in zip_list:
        print(i, "is not represented in geojson")


In [None]:
avg_data = data.groupby(["Region"]).median().reset_index()[["Region","Price"]]

In [None]:
fig2 = px.choropleth_mapbox(
    avg_data,
    locations=avg_data["Region"],
    geojson=geodata,
    color=avg_data["Price"],
    zoom=9.7, 
    center = {"lat": 52.356157, "lon": 4.907736},
    color_continuous_scale="matter", 
    mapbox_style="carto-positron",
    hover_data=avg_data,
    range_color=[avg_data["Price"].min(),avg_data["Price"].max()],
    width=1000,
    height=500
)

fig2.update_layout(                      
    margin={"r":0,"t":0,"l":0,"b":0}
)

fig2.show()

In [None]:
agg_data = (data
            .groupby("Region")
            .count()
            .sort_values(by="Price", ascending=False)
            .reset_index())

fig3 = plt.subplots(figsize=(10,5))
fig3 = sns.barplot(x=agg_data.iloc[:,0], y=agg_data.iloc[:,1], order=agg_data["Region"], color="lightblue")
fig3.set_xticklabels(labels=agg_data.iloc[:,0],fontdict={"fontsize":"8","rotation":"vertical"})
plt.show()

When plotting the count of houses per region we realize that many regions represent less than 10 houses. Since we already established that the price in dependent on the region the house is in we will circumvent this high imbalance of data by creating "new regions" that correspond to a certain price range<br>
Additionally, we will reduce the dimentionality compared to the simple straight forwards approach of simply One-Hot encoding the zip codes.<br>
<br>
We will arbitrarily create five regions which should represent realistic boundaries: less > 300000, less > 600000, less > 900000, less > 1200000, more < 1200000

### PART II: Data Engineering

To prepare the data for ML training we have to encode the columns. After preparing One-Hot encoded data set as benchmark we will prepare data sets to simplify the dimensionality of the data. So in total we will have three approaches:
<ul>
<li><b>One-Hot encoding:</b> Straight forward one-hot encoding as benchmark model.
<li><b>Bin Sequence encoding:</b> each zip-region is represented by all new price-regions. Each price-region column holds a 1 if any house of a specific zip-region is within its boundaries. This leads to a 5 digit encoding (e.g. {"0_to_300k": <b>1</b>,"300k_to_600k": <b>1</b>,"600k_to_900k": <b>1</b>,"900k_to_1200k": <b>0</b>,"more_than_1200000": <b>0</b>})</li>
<li><b>Label encoding:</b> each price-regions is encoded into one column (possible because the new price-regions have a meaningful relation). Each zip-region is associated with the most frequent price-region they appear in the data</li>
</ul>

#### One-Hot Encoding

In [None]:
oh_data = data.copy().reset_index()

In [None]:
enc = OneHotEncoder(drop="first")
cats = pd.DataFrame(enc.fit_transform(oh_data[["Region"]]).toarray())
oh_data = pd.concat([oh_data,cats], axis=1).drop(["index"], axis=1)

In [None]:
# creating encoding pattern for Sequence Encoding enabling to properly encode possible future data points for predictions
encode_pattern_var1 = data.copy().reset_index()

enc = OneHotEncoder(drop="first")
encode_pattern_var1 = enc.fit(encode_pattern_var1[["Region"]])


In [None]:
# creating  Label Encoding function enabling to properly encode possible future data points for predictions
def EncodeRegionVar1(transformData: pd.DataFrame, OH_Encoder: OneHotEncoder, encoderColumn: str) -> pd.DataFrame:
    
    if "index" in transformData.columns:
        transformData = transformData.drop(["index"], axis=1)
        transformData = transformData.reset_index()
    
    if "index" not in transformData.columns:
        transformData = transformData.reset_index()
    
    temp_frame = pd.DataFrame((OH_Encoder.transform(transformData[[encoderColumn]])).toarray())
    transformData = pd.concat([transformData,temp_frame], axis=1).drop([encoderColumn], axis=1)
    transformData = transformData.drop(["index"], axis=1)
    transformData.columns = transformData.columns.astype(str)
       
    return transformData

### Bin-Sequence Encoding

In [None]:
# creating bin regions based on prices
new_regions_bins = [[0,300000],[300000,600000],[600000,900000],[900000,1200000],[1200000,1000000000]]
new_regions = ["0_to_300k","300k_to_600k","600k_to_900k","900k_to_1200k","more_than_1200000"]

# creating dataframe that contains the actual region of each house sorted into the correct new column
df_data = [list(np.where(data["Price"].isin(data.loc[(data["Price"] > new_regions_bins[i][0]) & (data["Price"] <= new_regions_bins[i][1])]["Price"])==True,data["Region"].astype(int),0)) for i in range(0,len(new_regions_bins))]
temp_frame = pd.DataFrame(data=dict(zip(new_regions,df_data)), dtype="int32")

# add the new dataframe to the data
data = pd.concat([data.reset_index(),temp_frame], axis=1).drop(["index"], axis=1)

Plotting the count of houses in each regions shows a big improvement as the lowest count is now 70. However, we also see that the data is still very imbalanced.<br>
<br>
<b>Since the data set is so small we do not have a very good angle to work on the data imbalance. Over-sampling is extremely likely to lead to over fitting. Under-sampling is also no option with this small data set. For the purpose of this exercise we will just continue with the data as is.</b><br>
<br>



In [None]:
# plotting new price-regions vs house count within the region
fig4, ax = plt.subplots(figsize=(10,5))
fig4 = sns.barplot(data = (pd.DataFrame(data[new_regions][data[new_regions] != 0].count()).transpose()),color="lightblue", errorbar="sd")
fig4.set_title("Number of Houses per Price Category")
fig4.set_ylabel("Count")
fig4.set_xlabel("Price Category")
plt.show()


In [None]:
# label encoding
for t,i in enumerate(new_regions):
    data[i] = np.where(data[i] == 0,0,t+1)

data["price_cat_encoded"] = data[new_regions].sum(axis=1)
data = data.drop(new_regions, axis=1)

In [None]:
# creating encoding pattern for Bin Sequence Encoding enabling to properly encode possible future data points for predictions
encode_pattern_var2 = data.copy()

region_labels_collection = {i: [temp_frame[i].unique()] for i in new_regions}

for i in region_labels_collection:
    encode_pattern_var2.loc[encode_pattern_var2["Region"].isin(*region_labels_collection[i]) == True, i] = 1

encode_pattern_var2 = encode_pattern_var2.drop(["Price", "Area", "Room","price_cat_encoded"], axis=1).drop_duplicates(subset="Region")

In [None]:
# creating Bin Sequence Encoding function enabling to properly encode also possible future data points for predictions
def EncodeRegionVar2(transformData: pd.DataFrame, mergeColumn: str, encodePattern: pd.DataFrame, encoderColumns: list) -> pd.DataFrame:
    transformData = pd.merge(transformData,encodePattern,how="left", on=mergeColumn)
    
    for i in ["Region","price_cat_encoded"]:
        if i in transformData.columns:
            transformData = transformData.drop(i, axis=1)
            
    transformData[encoderColumns] = transformData[encoderColumns].fillna(0)
    return transformData

### Label Encoding (Region by frequency)

In [None]:
# creating encoding pattern for Label Encoding enabling to properly encode possible future data points for predictions
encode_pattern_var3 = data.copy()

encode_pattern_var3 = [(encode_pattern_var3
                    .query("Region == @i")
                    .groupby("price_cat_encoded")
                    .count()
                    .idxmax()["Region"]) 
                    
                    for i in encode_pattern_var3["Region"].unique()]

encode_pattern_var3 = pd.DataFrame({"Region":list(data["Region"].unique()),"most_frequent_cat":encode_pattern_var3})
encode_pattern_var3["most_frequent_cat"] = encode_pattern_var3["most_frequent_cat"].astype("category")

# translate the numeric label to the actual label just for later plotting
temp_for_plot = encode_pattern_var3.copy()
temp_for_plot["most_frequent_cat"] = temp_for_plot["most_frequent_cat"].replace([1,2,3,4,5],new_regions)

In [None]:
# creating  Label Encoding function enabling to properly encode also possible future data points for predictions
def EncodeRegionVar3(transformData: pd.DataFrame, NewColumnName: str, encodePattern: pd.DataFrame, encoderColumn: str) -> pd.DataFrame:
    transformData["temp_col1"] = [int(encodePattern.query("Region == @i")[encoderColumn]) for i in transformData["Region"]]
    transformData = transformData.drop([NewColumnName,"Region"], axis=1).rename(columns={"temp_col1" : NewColumnName})
    return transformData

Again we take a look at the new price-regions (bins) projected on a map of Amsterdam. We clearly see that most areas oe city have the majority of houses prices at 300k to 600k. This is true for more suburban as well as central areas.

In [None]:
# plotting the newly created price-regions reflected by the most frequent price within each zip-region
fig2 = px.choropleth_mapbox(
    temp_for_plot,
    locations=temp_for_plot["Region"],
    geojson=geodata,
    color=temp_for_plot["most_frequent_cat"],
    zoom=9.7, 
    center = {"lat": 52.356157, "lon": 4.907736},
    category_orders={"most_frequent_cat":new_regions},
    color_discrete_sequence = ["#fce6aa","#f08e62","#c53a59","#781a60","#282828"],
    mapbox_style="carto-positron",
    hover_data=[temp_for_plot["Region"],temp_for_plot["most_frequent_cat"]],
    width=1000,
    height=500
)

fig2.update_layout(                      
    margin={"r":0,"t":0,"l":0,"b":0},
    legend_title_text="Most Frequent Category within a Region"

)

fig2.show()

### PART II: Modelling

Now we prepare the data by splitting with stratifying (Regions), normal scaling and using the encoding functions.<br>
<br>
We will use a simple SGD Regression as well as three types of decision trees (Random Forest, AdaBoosted and XGBoosted).<br>
<br>
Finally we will create an assemble with all 4 the get our final predictions. We will estimate the performance of the estimation with the R2 and MSE values. To have a more intuitive understanding of the error we will also look at the MAE

In [None]:
# scaling
scaler = StandardScaler()
data_scaled = data[["Price","Area","Room","Region","price_cat_encoded"]].copy()
X_to_scale = ["Area", "Room"]
data_scaled[X_to_scale] = pd.DataFrame(scaler.fit_transform(data[X_to_scale]), columns=X_to_scale)


In [None]:
# splitting
rseed = random.seed(10)
X_train, x_test, Y_train, y_test = train_test_split(data_scaled[data_scaled.columns[1:]], 
                                                    data_scaled["Price"], 
                                                    test_size=0.15, 
                                                    train_size=0.85, 
                                                    random_state=rseed, 
                                                    shuffle=True, 
                                                    stratify=data["price_cat_encoded"])

In [None]:
# encoding variant 1: One-Hot encoding
X_train_1 = EncodeRegionVar1(X_train,encode_pattern_var1,"Region").drop(["price_cat_encoded"], axis=1)
x_test_1 = EncodeRegionVar1(x_test,encode_pattern_var1,"Region").drop(["price_cat_encoded"], axis=1)

In [None]:
# encoding variant 2: Bin Sequence encoding
X_train_2 = EncodeRegionVar2(X_train,"Region",encode_pattern_var2,  new_regions)
x_test_2 = EncodeRegionVar2(x_test,"Region",encode_pattern_var2,  new_regions)


In [None]:
# encoding variant 3: label encoding
X_train_3 = EncodeRegionVar3(X_train,"price_cat_encoded",encode_pattern_var3,"most_frequent_cat")
x_test_3 = EncodeRegionVar3(x_test,"price_cat_encoded",encode_pattern_var3,"most_frequent_cat")


In [None]:
# SGD modelling
param_grid_SGD = {"loss" : ["squared_error"],
              "alpha" : [0.0001, 0.001, 0.01, 0.1, 0.2],
              "penalty" : ["l2", "l1", "elasticnet", "none"],
              "eta0" : [0.0001,0.001,0.01,0.1, 0.2, 0.3, 0.4, 0.5],
                "max_iter" : [2000,10000,20000]
             }

model_SGD = GridSearchCV(SGDRegressor(random_state=rseed), param_grid_SGD ,cv=10, verbose=1, n_jobs=-1)

model_SGD_mono = model_SGD.fit(X_train_1,Y_train)
print(model_SGD_mono.best_estimator_)


y_pred = model_SGD_mono.predict(x_test_1)

print("R2: ", r2_score(y_test,y_pred))
print("MSE: ", mean_squared_error(y_test,y_pred))
print("MAE: ", mean_absolute_error(y_test,y_pred))

In [None]:
# RandomForest modelling
param_grid_RanFor = {"criterion" : ["squared_error","absolute_error"],
              "min_samples_split" : [2,3,4,5,6],
              "min_samples_leaf" : [2,3,4,5,6],
              "min_impurity_decrease" : [0,0.01,0.25,0.5]
             }

model_RanFor = GridSearchCV(RandomForestRegressor(random_state=rseed), param_grid_RanFor, cv=20, verbose = 1, n_jobs=-1)

model_RanFor_mono = model_RanFor.fit(X_train_1,Y_train)
print(model_RanFor_mono.best_estimator_)


y_pred = model_RanFor_mono.predict(x_test_1)

print("R2: ", r2_score(y_test,y_pred))
print("MSE: ", mean_squared_error(y_test,y_pred))
print("MAE: ", mean_absolute_error(y_test,y_pred))

In [None]:
# AdaBoost modelling
param_grid_Ada = {"loss" : ["linear","square", "exponential"],
              "n_estimators" : [50,100,500,1000],
              "learning_rate" : [0.1,0.01,0.001,0.0001]
             }

model_Ada = GridSearchCV(AdaBoostRegressor(random_state=rseed, base_estimator=None), param_grid_Ada, cv=20, verbose=1, n_jobs=-1)

model_Ada_mono = model_Ada.fit(X_train_1,Y_train)
print(model_Ada_mono.best_estimator_)


y_pred = model_Ada_mono.predict(x_test_1)

print("R2: ", r2_score(y_test,y_pred))
print("MSE: ", mean_squared_error(y_test,y_pred))
print("MAE: ", mean_absolute_error(y_test,y_pred))

In [None]:
# XGBoost modelling
param_grid_XGB = {"n_estimators" : [500,750,1000,2000,5000],
              "learning_rate" : [0.01,0.001,0.0001],
              "max_depth" : [1,2,3,4],
              "booster" : ["gbtree", "gblinear"]
             }

model_XGB = GridSearchCV(xgb.XGBRegressor(random_state=rseed), param_grid_XGB, cv=20, verbose=1, n_jobs=-1)

model_XGB_mono = model_XGB.fit(X_train_1,Y_train)
print(model_XGB_mono.best_estimator_)


y_pred = model_XGB_mono.predict(x_test_1)

print("R2: ", r2_score(y_test,y_pred))
print("MSE: ", mean_squared_error(y_test,y_pred))
print("MAE: ", mean_absolute_error(y_test,y_pred))

In [None]:
%%time
# Voting Regressor Ensemble of all 4 model types - One-Hot Encoded - Variant 1
model_Vote_1 = VotingRegressor([('model_SGF', model_SGD), ('model_RanFor', model_RanFor), ('model_Ada', model_Ada), ('model_XGB', model_XGB)])

model_Vote_1.fit(X_train_1,Y_train)

y_pred_1 = model_Vote_1.predict(x_test_1)

print("---One-Hot-Encoding---")
print("R2: ", r2_score(y_test,y_pred_1))
print("MSE: ", mean_squared_error(y_test,y_pred_1))
print("MAE: ", mean_absolute_error(y_test,y_pred_1))

In [None]:
%%time
# Voting Regressor Ensemble of all 4 model types - Bin Sequence Encoded - Variant 2
model_Vote_2 = VotingRegressor([('model_SGF', model_SGD), ('model_RanFor', model_RanFor), ('model_Ada', model_Ada), ('model_XGB', model_XGB)])

model_Vote_2.fit(X_train_2,Y_train)

y_pred_2 = model_Vote_2.predict(x_test_2)

print("---Sequence Encoded---")
print("R2: ", r2_score(y_test,y_pred_2))
print("MSE: ", mean_squared_error(y_test,y_pred_2))
print("MAE: ", mean_absolute_error(y_test,y_pred_2))

In [None]:
%%time
# Voting Regressor Ensemble of all 4 model types - Label Encoded - Variant 3
model_Vote_3 = VotingRegressor([('model_SGF', model_SGD), ('model_RanFor', model_RanFor), ('model_Ada', model_Ada), ('model_XGB', model_XGB)])

model_Vote_3.fit(X_train_3,Y_train)

y_pred_3 = model_Vote_3.predict(x_test_3)

print("---Label Encoded---")
print("R2: ", r2_score(y_test,y_pred_3))
print("MSE: ", mean_squared_error(y_test,y_pred_3))
print("MAE: ", mean_absolute_error(y_test,y_pred_3))

In [None]:
filename = "one_hot_enc_model.sav"
pickle.dump(model_Vote_1, open(filename, 'wb'))

filename = "one_hot_enc.sav"
pickle.dump(enc, open(filename, 'wb'))

filename = "sequence_enc_model.sav"
pickle.dump(model_Vote_2, open(filename, 'wb'))

filename = "lable_enc_model.sav"
pickle.dump(model_Vote_3, open(filename, 'wb'))

#### <b>Final Thoughts</b>
We have seen that the features of the data set, although few, are good estimator for the house price. However, the imbalance of the data set (specifically of the categorical feature) is a critical issue that should be stressed in a real scenario. Since the data set is so small we have not much wiggle room to combat this issue.<br>
The performance of the models are not the worst. The Sequence Encoding of the zip-regions performs a little better. However, it is unlikely that somebody in real life would be happy with a error margin of about 130000. Additionally, although we use a grid search that includes a K-fold of 10 the performance might vary after every training. We are in the same situation when looking at the data splitting. The representation of the data might highly vary from one split to the next. To combat this we could go on and create and train different sets of splits however for the purpose of this notebook and the low potential gain we conclude this exercise at this point