<div>
<img src="https://drive.google.com/uc?export=view&id=1vK33e_EqaHgBHcbRV_m38hx6IkG0blK_" width="350"/>
</div> 

#**Artificial Intelligence - MSc**
##ET5003 - MACHINE LEARNING APPLICATIONS 

###Instructor: Enrique Naredo
###ET5003_Etivity-2

In [300]:
#@title Current Date
Today = '2021-08-22' #@param {type:"date"}


In [301]:
#@markdown ---
#@markdown ### Enter your details here:
Student_ID = "20214537" #@param {type:"string"}
Student_full_name = "Tom Keane" #@param {type:"string"}
#@markdown ---

In [302]:
#@title Notebook information
Notebook_type = 'Example' #@param ["Example", "Lab", "Practice", "Etivity", "Assignment", "Exam"]
Version = 'Draft' #@param ["Draft", "Final"] {type:"raw"}
Submission = False #@param {type:"boolean"}

# INTRODUCTION

**Piecewise regression**, extract from [Wikipedia](https://en.wikipedia.org/wiki/Segmented_regression):

Segmented regression, also known as piecewise regression or broken-stick regression, is a method in regression analysis in which the independent variable is partitioned into intervals and a separate line segment is fit to each interval. 

* Segmented regression analysis can also be performed on 
multivariate data by partitioning the various independent variables. 
* Segmented regression is useful when the independent variables, clustered into different groups, exhibit different relationships between the variables in these regions. 

* The boundaries between the segments are breakpoints.

* Segmented linear regression is segmented regression whereby the relations in the intervals are obtained by linear regression. 

***The goal is to use advanced Machine Learning methods to predict House price.***

## Imports

In [303]:
# Suppressing Warnings:
import warnings
warnings.filterwarnings("ignore")

In [304]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import pymc3 as pm
import arviz as az
from sklearn.preprocessing import StandardScaler

In [305]:
# to plot
import matplotlib.colors
from mpl_toolkits.mplot3d import Axes3D

# to generate classification, regression and clustering datasets
import sklearn.datasets as dt

# to create data frames
from pandas import DataFrame

# to generate data from an existing dataset
from sklearn.neighbors import KernelDensity
from sklearn.model_selection import GridSearchCV

In [306]:
# Define the seed so that results can be reproduced
seed = 11
rand_state = 11

# Define the color maps for plots
color_map = plt.cm.get_cmap('RdYlBu')
color_map_discrete = matplotlib.colors.LinearSegmentedColormap.from_list("", ["red","cyan","magenta","blue"])

# DATASET

Extract from this [paper](https://ieeexplore.ieee.org/document/9300074):

* House prices are a significant impression of the economy, and its value ranges are of great concerns for the clients and property dealers. 

* Housing price escalate every year that eventually reinforced the need of strategy or technique that could predict house prices in future. 

* There are certain factors that influence house prices including physical conditions, locations, number of bedrooms and others.


1. [Download the dataset](https://github.com/UL-ET5003/ET5003_SEM1_2021-2/tree/main/Week-3). 

2. Upload the dataset into your folder.



The challenge is to predict the final price of each house.

## Training & Test Data

In [307]:
def import_datasets(git_link = 'https://raw.githubusercontent.com/tomkeane07/AI-Projects-UL/main/semester3/MachineLearningApplications/PiecewiseRegression'):
  return {
      'house_test' : pd.read_csv(git_link+'/house_test.csv'),
      'house_train' : pd.read_csv(git_link+'/house_train.csv'),
      'true_price'  : pd.read_csv(git_link+'/true_price.csv')}

try:
  dbs
except:
  dbs = import_datasets()

# split data into training and test
from sklearn.model_selection import train_test_split

# training: 70% (0.7), test: 30% (0.3) 
# you could try any other combination 
# but consider 50% of training as the low boundary
# X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.3)

### Train dataset

In [308]:
dftrain = dbs['house_test' ]
dftest =  dbs['house_train']
dfcost = dbs['true_price']

In [309]:
# show first data frame rows 
dftrain.head()

Unnamed: 0,ad_id,bathrooms,beds,ber_classification,county,description_block,environment,features,latitude,longitude,no_of_units,property_category,property_type,surface,feature_line_count
0,12373510,2.0,4.0,G,Dublin,"It's all in the name ""Island View"";. Truly won...",prod,Breath-taking panoramic views radiate from thi...,53.566881,-6.101148,,sale,bungalow,142.0,3
1,12422623,2.0,3.0,C1,Dublin,REA McDonald - Lucan' s longest established es...,prod,Gas fired central heating.\nDouble glazed wind...,53.362992,-6.452909,,sale,terraced,114.0,6
2,12377408,3.0,4.0,B3,Dublin,REA Grimes are proud to present to the market ...,prod,Pristine condition throughout\nHighly sought-a...,53.454198,-6.262964,,sale,semi-detached,172.0,10
3,12420093,4.0,3.0,A3,Dublin,"REA McDonald, Lucan' s longest established est...",prod,A-rated home within a short walk of Lucan Vill...,53.354402,-6.458647,,sale,semi-detached,132.4,8
4,12417338,1.0,3.0,E2,Dublin,"Hibernian Auctioneers are delighted to bring, ...",prod,Mature Location \nGas Heating \nClose to Ameni...,53.33653,-6.393587,,sale,semi-detached,88.0,7


In [310]:
# Generate descriptive statistics
dftrain.describe()

Unnamed: 0,ad_id,bathrooms,beds,latitude,longitude,no_of_units,surface,feature_line_count
count,500.0,500.0,500.0,500.0,500.0,0.0,500.0,500.0
mean,12316950.0,1.994,2.93,53.356034,-6.247842,,156.007671,7.068
std,148583.2,1.106532,1.191612,0.081905,0.088552,,344.497362,2.210237
min,11306150.0,0.0,0.0,53.221348,-6.496987,,33.5,1.0
25%,12286170.0,1.0,2.0,53.297373,-6.296404,,72.375,5.0
50%,12379640.0,2.0,3.0,53.339547,-6.243572,,98.0,7.0
75%,12405440.0,3.0,4.0,53.38165,-6.185055,,138.935,9.0
max,12428090.0,8.0,7.0,53.619775,-6.064874,,5746.53612,10.0


### Test dataset

In [311]:
# show first data frame rows 
dftest.head()

Unnamed: 0,ad_id,bathrooms,beds,ber_classification,county,description_block,environment,features,latitude,longitude,no_of_units,price,property_category,property_type,surface
0,996887,,,,Dublin,A SELECTION OF 4 AND 5 BEDROOM FAMILY HOMES LO...,prod,,53.418216,-6.149329,18.0,,new_development_parent,,
1,999327,,,,Dublin,**Last 2 remaining houses for sale ***\n\nOn v...,prod,,53.364917,-6.454935,3.0,,new_development_parent,,
2,999559,,,,Dublin,Final 4 &amp; 5 Bedroom Homes for Sale\n\nOn V...,prod,,53.273447,-6.313821,3.0,,new_development_parent,,
3,9102986,,,,Dublin,"Glenveagh Taylor Hill, Balbriggan\n\n*Ideal st...",prod,,53.608167,-6.210914,30.0,,new_development_parent,,
4,9106028,,,,Dublin,*New phase launching this weekend Sat &amp; Su...,prod,,53.262531,-6.181527,8.0,,new_development_parent,,


In [312]:
# Generate descriptive statistics
dftest.describe()

Unnamed: 0,ad_id,bathrooms,beds,latitude,longitude,no_of_units,price,surface
count,2982.0,2931.0,2931.0,2982.0,2982.0,59.0,2892.0,2431.0
mean,12240650.0,1.998635,2.979188,53.355991,-6.257175,7.440678,532353.6,318.851787
std,579303.7,1.291875,1.468408,0.086748,0.141906,8.937081,567814.8,4389.423136
min,996887.0,0.0,0.0,51.458439,-6.521183,0.0,19995.0,3.4
25%,12268130.0,1.0,2.0,53.298929,-6.314064,2.0,280000.0,74.1
50%,12377580.0,2.0,3.0,53.345497,-6.252254,3.0,380000.0,100.0
75%,12402940.0,3.0,4.0,53.388845,-6.196049,8.0,575000.0,142.0
max,12428360.0,18.0,27.0,53.630588,-1.744995,36.0,9995000.0,182108.539008


### Expected Cost dataset

In [313]:
# Generate descriptive statistics
dfcost.describe()

Unnamed: 0,Id,Expected
count,500.0,500.0
mean,12316950.0,581035.6
std,148583.2,600919.4
min,11306150.0,85000.0
25%,12286170.0,295000.0
50%,12379640.0,425000.0
75%,12405440.0,595000.0
max,12428090.0,5750000.0


In [314]:
# one last look at dataset
dftrain.head()

Unnamed: 0,ad_id,bathrooms,beds,ber_classification,county,description_block,environment,features,latitude,longitude,no_of_units,property_category,property_type,surface,feature_line_count
0,12373510,2.0,4.0,G,Dublin,"It's all in the name ""Island View"";. Truly won...",prod,Breath-taking panoramic views radiate from thi...,53.566881,-6.101148,,sale,bungalow,142.0,3
1,12422623,2.0,3.0,C1,Dublin,REA McDonald - Lucan' s longest established es...,prod,Gas fired central heating.\nDouble glazed wind...,53.362992,-6.452909,,sale,terraced,114.0,6
2,12377408,3.0,4.0,B3,Dublin,REA Grimes are proud to present to the market ...,prod,Pristine condition throughout\nHighly sought-a...,53.454198,-6.262964,,sale,semi-detached,172.0,10
3,12420093,4.0,3.0,A3,Dublin,"REA McDonald, Lucan' s longest established est...",prod,A-rated home within a short walk of Lucan Vill...,53.354402,-6.458647,,sale,semi-detached,132.4,8
4,12417338,1.0,3.0,E2,Dublin,"Hibernian Auctioneers are delighted to bring, ...",prod,Mature Location \nGas Heating \nClose to Ameni...,53.33653,-6.393587,,sale,semi-detached,88.0,7


### Data Encoding

In [315]:
try:
  dftrain.drop('facility', axis=1, inplace=True)
  dftrain.drop('area', axis=1, inplace=True)
  dftest.drop('facility', axis=1, inplace=True)
  dftest.drop('area', axis=1, inplace=True)
except:
  pass
for i in range(3):
  print(list(dftrain['features']))

['Breath-taking panoramic views radiate from this waterside property\nDetached 4 bed bungalow with attic conversion, on c.0.66 acre elevated site\nLarge kitchen Diner to rear benefiting from stunning island views\n', "Gas fired central heating.\nDouble glazed windows.\nRear garden (8.92m long) with cobble lock patio, lawn and timber shed.\nLocated within St Mary's Parish.\nDesignated car space.\nNestled away in a quiet cul de sac location.\n", 'Pristine condition throughout\nHighly sought-after residential development\nB3 energy rating\nEnviable position within the estate\nFully alarmed\nGFCH heating\nPrivate low maintenance rear garden\nLarge extension\nExcellent school and sports facilities\nShuttle bus service to Swords village\n', 'A-rated home within a short walk of Lucan Village\nSpacious room layout with accommodation over 3 levels\nPassive triple-glazed windows and patio door\nGas fired central heating system\n10 year Homebond Structural Guarantee\nIntruder alarm system fitted\

Having looked at the 'features, I have surmised that each feature is marked by a '\n' new line.

In [316]:
dftrain['feature_line_count'] = dftrain['features'].apply(lambda x: x.count('\n'))
dftrain['feature_line_count']

0       3
1       6
2      10
3       8
4       7
       ..
495    10
496    10
497    10
498     6
499     8
Name: feature_line_count, Length: 500, dtype: int64

The 'features' category may not be of much use to me, however it may be plausible that the number of words included here would suggest more features
The number of features a property offers may be relevant information. For this reason, I am going to include the length of 

# PIECEWISE REGRESSION

## Full Model

In [317]:
# select some features columns just for the baseline model
# assume not all of the features are informative or useful
# in this exercise you could try all of them if possible

featrain = ['latitude', 'longitude', 'bathrooms', 'beds', 'surface']
# dropna: remove missing values
df_subset_train = dftrain[featrain].dropna(axis=0)

featest = ['feature_1','feature_2','feature_3']
df_subset_test  =  dftest[featest].dropna(axis=0)

# cost
df_cost = df_cost[df_cost.index.isin(df_subset_test.index)]

KeyError: ignored

In [None]:
# model
with pm.Model() as model:
    #prior over the parameters of linear regression
    alpha = pm.Normal('alpha', mu=0, sigma=30)
    #we have one beta for each column of Xn
    beta = pm.Normal('beta', mu=0, sigma=30, shape=Xn_train.shape[1])
    #prior over the variance of the noise
    sigma = pm.HalfCauchy('sigma_n', 5)
    #linear regression model in matrix form
    mu = alpha + pm.math.dot(beta, Xn_train.T)
    #likelihood, be sure that observed is a 1d vector
    like = pm.Normal('like', mu=mu, sigma=sigma, observed=yn_train[:,0])
    

In [None]:
# prediction
ll=np.mean(posterior['alpha']) + np.dot(np.mean(posterior['beta'],axis=0), Xn_test.T)
y_pred_BLR = np.exp(yscaler.inverse_transform(ll.reshape(-1,1)))[:,0]
print("MAE = ",(np.mean(abs(y_pred_BLR - y_test))))
print("MAPE = ",(np.mean(abs(y_pred_BLR - y_test) / y_test)))

## Clustering

### Full Model

In [None]:
# training gaussian mixture model 
from sklearn.mixture import GaussianMixture

gmm = GaussianMixture(n_components=4)


### Clusters

In [None]:
# train clusters



In [None]:
# test clusters


## Piecewise Model

In [None]:
# model_0
with pm.Model() as model_0:
  # prior over the parameters of linear regression
  alpha = pm.Normal('alpha', mu=0, sigma=30)
  # we have a beta for each column of Xn0
  beta = pm.Normal('beta', mu=0, sigma=30, shape=Xn0.shape[1])
  # prior over the variance of the noise
  sigma = pm.HalfCauchy('sigma_n', 5)
  # linear regression relationship
  #linear regression model in matrix form
  mu = alpha + pm.math.dot(beta, Xn0.T)
  # likelihood, be sure that observed is a 1d vector
  like = pm.Normal('like', mu=mu, sigma=sigma, observed=yn0[:,0])



##Simulations

### Only Cluster 0

## Overall

## Test set performance

### PPC on the Test set



# SUMMARY