<br>

<br>

### 3. Model Planning and Building

&nbsp;&nbsp;&nbsp;&nbsp; 3.1. Import Libraries <br>
&nbsp;&nbsp;&nbsp;&nbsp; 3.2. Define Dependent and Independent Variables <br>
&nbsp;&nbsp;&nbsp;&nbsp; 3.3. Define Train and Test Set <br>
&nbsp;&nbsp;&nbsp;&nbsp; 3.4. Model Building <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.4.1. Multiple Linear Regression <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.4.2. Decision Tree <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.4.3. Random Forest <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.4.4. Multi Layer Perceptron Model <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.4.5. KMeans Clustering <br>

#### 3.1. Import Libraries <br>

In [14]:
#import libraries
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestRegressor 
from sklearn.neural_network import MLPClassifier
from sklearn.cluster import KMeans

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score 
from sklearn.datasets import make_classification

import math

#### Method Definition
In here, we defined methods for splitting the data, getting the accuracy score, predictions, and errors to reduce our code 

In [15]:
def split_dataset(cleaned_df, X, Y): 
    # Splitting the dataset into train and test 
    X_train, X_test, y_train, y_test = train_test_split(  
    X, Y, test_size = 0.2, random_state = 100) 
      
    return X_train, X_test, y_train, y_test

def get_accuracy_score(model, x_set, y_set):
    return model.score(x_set, y_set)

def get_predictions(model, x_set): 
    y_predictions = model.predict(X_test) 
    return y_predictions

def get_errors(y_test, y_predict):
    # calculate the mean squared error
    model_mse = mean_squared_error(y_test, y_predict)
    # calculate the mean absolute error
    model_mae = mean_absolute_error(y_test, y_predict)
    # calulcate the root mean squared error
    model_rmse =  math.sqrt(model_mse)
    return model_mse, model_mae, model_rmse


#### 3.2. Define Dependent and Independent Variables <br>

In [16]:
X = cleaned_df.drop('Price', axis = 1)
Y = cleaned_df[['Price']]

print('Independent Variables: ')
display(X.head(5))
print('-'*100)
print('Dependent Variables: ')
display(Y.head(5))

Independent Variables: 


Unnamed: 0,Bedrooms,Bathrooms,Floor,Floors Area,Lot Area,Garage,Reservation Fee,Location_Quezon City,Location_Pasig City,Location_Muntinlupa City,Location_Marikina City,Location_Paranaque City,Location_Caloocan City,Location_Taguig City,Location_Manila City,Garden_yes,Pet_Friendly_yes
0,2.0,1.0,2.0,67.0,55.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
1,3.0,2.0,2.0,78.0,58.0,1.0,30000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
2,3.0,2.0,2.0,65.0,60.0,1.0,50000.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
3,3.0,2.0,2.0,70.0,52.0,1.0,30000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
4,3.0,2.0,2.0,65.0,48.0,1.0,50000.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0


----------------------------------------------------------------------------------------------------
Dependent Variables: 


Unnamed: 0,Price
0,3900001.0
1,4987000.0
2,4200000.0
3,4480000.0
4,3800000.0


#### 3.3. Define Train and Test Set <br>

In [17]:
X_train, X_test, y_train, y_test = split_dataset(cleaned_df, X, Y) 

#### 3.3. Model Building <br>

#### 3.3.1. Multiple Linear Regression <br>

#### Accuracy Test
 Quantifying the accuracy of the model is an important step to justifying the usage of the model.<br> <br> As shown below, the model has an accuracy of 93.78% for the train set and 84.45% for the test set. Thus, implies that the predictions and usage of the model linear regression to our data is 93.78% accurate for the train set and 84.45% for the test set.

In [18]:
# create a Linear Regression model object
regression_model = LinearRegression()

# pass through the X_train & y_train data set
regression_model.fit(X_train, y_train)

accuracy_train = get_accuracy_score(regression_model, X_train, y_train)
accuracy_test = get_accuracy_score(regression_model, X_test, y_test)

print('Accuracy Train: {0:.2f}%'.format(accuracy_train * 100))
print('Accuracy Test: {0:.2f}%'.format(accuracy_test * 100))

Accuracy Train: 93.78%
Accuracy Test: 84.45%


#### Predictions
Applying predictions is a good way to improve more and determine if the model is not only itself accurate but also diverse. <br><br>As shown below, in the first row, we predicted a value of P4,355,528 if all independent variables were to be 1 and P4,474,800 if all independent variables were to be 2 and so on and so forth. In addtion, the model has an R Squared of 84% which implies that the model is a good fit to our data

In [19]:
# Get multiple predictions
y_predictions= get_predictions(regression_model, X_test)

# Show the first 5 predictions
print('-'*100)
print('Predictions:')
display(y_predictions[:5])

model_r2 = r2_score(y_test, y_predictions)
print("R Squared: {:.2}".format(model_r2))

----------------------------------------------------------------------------------------------------
Predictions:


array([[4355528.33250099],
       [4474800.65023242],
       [3926629.15459845],
       [3835382.94510465],
       [5240095.34950777]])

R Squared: 0.84


#### Prediction Errors
One of the simplest methods for calculating the correctness of a model is to use the error between predicted value and actual value. MSE is the average squared difference between the estimated values and the actual value, MAE tells us how big of an error we can expect from the forecast on average and lastly, RMSE is the square root of average value of squared error in a set of predicted values, without considering direction. <br><br>As shown below, with the fact that RMSE is higher than the MAE, this implies that there is a variation in the magnitude of errors and very large errors are unlikely to have occured.Thus, with RMSE not being too much high than MAE, implies that there is no outlier in the data

In [20]:
model_mse, model_mae, model_rmse = get_errors(y_test, y_predictions)

print("Multiple Regression Prediction Errors: ")
print("Mean Squared Error {0}".format(model_mse))
print("Mean Absolute Error {0}".format(model_mae))
print("Root Mean Squared Error {0}".format(model_rmse))

Multiple Regression Prediction Errors: 
Mean Squared Error 422186027888.23083
Mean Absolute Error 470159.0225516364
Root Mean Squared Error 649758.4381046782


#### Intercept and Coefficient 
The intercept is simply the value at which the fitted line crosses the y-axis. The coefficient indicates the direction of the relationship between a predictor variable and the response variable. A positive value indicates that as a specific independent variable increases while other remains constant, the dependent variable also increases.
A negative value indicates that as a specific independent variable decreases while other remains constant, the dependent variable also decreases.

In [21]:
# let's grab the coefficient of our model and the intercept
intercept = regression_model.intercept_[0]
coefficent = regression_model.coef_[0][0]

print("The intercept for our model is {:.4}".format(intercept))
print('-'*100)

# loop through the dictionary and print the data
for coef in zip(X.columns, regression_model.coef_[0]):
    print("The Coefficient for {} is {:.10}".format(coef[0],coef[1]))

The intercept for our model is 2.171e+06
----------------------------------------------------------------------------------------------------
The Coefficient for Bedrooms is 204526.1815
The Coefficient for Bathrooms is 174769.1812
The Coefficient for Floor is -79499.16129
The Coefficient for Floors Area is 12490.35098
The Coefficient for Lot Area is 14278.30712
The Coefficient for Garage is -225483.4805
The Coefficient for Reservation Fee is -0.9879470662
The Coefficient for Location_Quezon City is -196914.193
The Coefficient for Location_Pasig City is -560477.8682
The Coefficient for Location_Muntinlupa City is -158825.7114
The Coefficient for Location_Marikina City is -121996.858
The Coefficient for Location_Paranaque City is -69118.4079
The Coefficient for Location_Caloocan City is -591451.4094
The Coefficient for Location_Taguig City is -754345.4031
The Coefficient for Location_Manila City is 689282.7817
The Coefficient for Garden_yes is -14877.30036
The Coefficient for Pet_Friendl

#### 3.3.2 Decision Tree 

#### Gini Index and Entropy Accuracy Test
Quantifying the accuracy of the model is an important step to justifying the usage of the model. Gini index or Gini impurity measures the degree or probability of a particular variable being wrongly classified when it is randomly chosen. While entropy measure how “mixed” a column is. In this case, we measured the accuracy of the model through gini index and entropy.<br><br> As shown below, it is undeniable that the test set's predictions are not that precise for it only got low accuracy score. However, the train set produced a high accuracy score which means the predcitions are 97.89% correct

In [22]:
# Decision tree with entropy, gini
decision_entropy_model = DecisionTreeClassifier(criterion = "entropy",random_state = 100,)
decision_gini_model = DecisionTreeClassifier(criterion = "gini", random_state = 100,) 

decision_entropy_model.fit(X_train, y_train)
decision_gini_model.fit(X_train, y_train) 

accuracy_train_entropy = get_accuracy_score(decision_entropy_model, X_train, y_train)
accuracy_test_entropy = get_accuracy_score(decision_entropy_model, X_test, y_test)
accuracy_train_gini = get_accuracy_score(decision_gini_model, X_train, y_train)
accuracy_test_gini = get_accuracy_score(decision_gini_model, X_test, y_test)

print('Entropy Accuracy Train: {0:.2f}%'.format(accuracy_train_entropy * 100))
print('Entropy Accuracy Test: {0:.2f}%'.format(accuracy_test_entropy * 100))
print('-'*100)
print('Gini Accuracy Train: {0:.2f}%'.format(accuracy_train_gini * 100))
print('Gini Accuracy Test: {0:.2f}%'.format(accuracy_test_gini * 100))

y_train_predictions_entropy = get_predictions(decision_entropy_model, X_train)
y_test_predictions_entropy = get_predictions(decision_entropy_model, X_test)
y_train_predictions_gini = get_predictions(decision_gini_model, X_train)
y_test_predictions_gini= get_predictions(decision_gini_model, X_test)

Entropy Accuracy Train: 97.89%
Entropy Accuracy Test: 26.67%
----------------------------------------------------------------------------------------------------
Gini Accuracy Train: 97.89%
Gini Accuracy Test: 31.67%


#### Gini and Entropy Prediction Errors
One of the simplest methods for calculating the correctness of a model is to use the error between predicted value and actual value. MSE is the average squared difference between the estimated values and the actual value, MAE tells us how big of an error we can expect from the forecast on average and lastly, RMSE is the square root of average value of squared error in a set of predicted values, without considering direction. <br><br>As shown below, with the fact that RMSE is higher than the MAE, this implies that there is variation in the magnitude of errors and very large errors are unlikely to have occured.Thus, with RMSE not being too much high than MAE, implies that there is no outlier in the data.

In [23]:
model_mse, model_mae, model_rmse = get_errors(y_test, y_test_predictions_gini)

print("Decision Tree (Gini) Prediction Errors: ")
print("Mean Squared Error {0}".format(model_mse))
print("Mean Absolute Error {0}".format(model_mae))
print("Root Mean Squared Error {0}".format(model_rmse))

print('-'*100)

model_mse, model_mae, model_rmse = get_errors(y_test, y_test_predictions_entropy)
print("Decision Tree (Entropy) Prediction Errors: ")
print("Mean Squared Error {0}".format(model_mse))
print("Mean Absolute Error {0}".format(model_mae))
print("Root Mean Squared Error {0}".format(model_rmse))

Decision Tree (Gini) Prediction Errors: 
Mean Squared Error 269573796181.68332
Mean Absolute Error 286638.11666666664
Root Mean Squared Error 519204.96548249933
----------------------------------------------------------------------------------------------------
Decision Tree (Entropy) Prediction Errors: 
Mean Squared Error 205486024793.6
Mean Absolute Error 246998.1
Root Mean Squared Error 453305.6637563665


#### 3.3.3. Random Forest<br>

#### Accuracy Test
The accuracy of the model is an important step to justifying the usage of the model. Random Forest uses multiple models of several DTs to obtain a better prediction performance. It creates many classification trees and a bootstrap sample technique is used to train each tree from the set of training data. 
<br><br> As shown below, the model has an accuracy of 99.45% for the train set and 94.51% for the test set. Thus, implies that the predictions and usage of the model random forest to our data is 99.45% accurate for the train set and 94.51% for the test set.

In [24]:
random_forest_model = RandomForestRegressor(n_estimators = 100, random_state = 0)
random_forest_model.fit(X_train, y_train.values.flatten())  

accuracy_train = get_accuracy_score(random_forest_model,X_train,y_train)
accuracy_test = get_accuracy_score(random_forest_model,X_test,y_test)

print('Accuracy Train: {0}%'.format(accuracy_train * 100))
print('Accuracy Test: {0}%'.format(accuracy_test * 100))

Accuracy Train: 99.45435402762185%
Accuracy Test: 94.51481948652155%


#### Prediction Errors
One of the simplest methods for calculating the correctness of a model is to use the error between predicted value and actual value. MSE is the average squared difference between the estimated values and the actual value, MAE tells us how big of an error we can expect from the forecast on average and lastly, RMSE is the square root of average value of squared error in a set of predicted values, without considering direction. <br><br>As shown below, with the fact that RMSE is higher than the MAE, this implies that there is variation in the magnitude of errors and very large errors are unlikely to have occured.Thus, with RMSE not being too much high than MAE, implies that there is no outlier in the data.

In [25]:
# Get multiple predictions
y_predictions= get_predictions(random_forest_model, X_test)

model_mse, model_mae, model_rmse = get_errors(y_test, y_predictions)

print("Decision Tree (Gini) Prediction Errors: ")
print("Mean Squared Error {0}".format(model_mse))
print("Mean Absolute Error {0}".format(model_mae))
print("Root Mean Squared Error {0}".format(model_rmse))

Decision Tree (Gini) Prediction Errors: 
Mean Squared Error 148880943875.0694
Mean Absolute Error 236761.9166944445
Root Mean Squared Error 385850.93478579185


#### 3.3.4. Multi Layer Perceptron Model <br>

#### Accuracy Test
The accuracy of the model is an important step to justifying the usage of the model. A multilayer perceptron is a neural network connecting multiple layers in a directed graph, which means that the signal path through the nodes only goes one way.

As shown below, it is undeniable that the model is not a good fit to our data as it only have less than 11% of accuracy score. Thus, these values of accuracy scores imply that the predictions and usage of the model multi layer perceptron to our data is only 8.86% accurate for the train set and 10% accurate for the test set.

In [26]:
mlp_model = MLPClassifier(random_state=1, max_iter=300)
mlp_model.fit(X_train, y_train.values.flatten())

# accuracy percentage
accuracy_train = get_accuracy_score(mlp_model,X_train,y_train)
accuracy_test = get_accuracy_score(mlp_model,X_test,y_test)

print('Accuracy Train: {0}%'.format(accuracy_train * 100))
print('Accuracy Test: {0}%'.format(accuracy_test * 100))

Accuracy Train: 8.860759493670885%
Accuracy Test: 10.0%


#### Prediction Errors
One of the simplest methods for calculating the correctness of a model is to use the error between predicted value and actual value. MSE is the average squared difference between the estimated values and the actual value, MAE tells us how big of an error we can expect from the forecast on average and lastly, RMSE is the square root of average value of squared error in a set of predicted values, without considering direction. <br><br>
As shown below, with the fact that RMSE is higher than the MAE, this implies that there is variation in the magnitude of errors and very large errors are unlikely to have occured.Thus, with RMSE not being too much high than MAE, implies that there is no outlier in the data.

In [27]:
# Get multiple predictions
y_predictions= get_predictions(mlp_model, X_test)

model_mse, model_mae, model_rmse = get_errors(y_test, y_predictions)

print("Multiple Layer Perceptron" Prediction Errors: ")
print("Mean Squared Error {0}".format(model_mse))
print("Mean Absolute Error {0}".format(model_mae))
print("Root Mean Squared Error {0}".format(model_rmse))

Decision Tree (Gini) Prediction Errors: 
Mean Squared Error 9288676982848.316
Mean Absolute Error 2301333.316666667
Root Mean Squared Error 3047733.0891743647


#### 3.3.5. KMeans Clustering <br>

#### Accuracy Test
The accuracy of the model is an important step to justifying the usage of the model. K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible
<br><br>
As shown below, it is undeniable that the model is not a good fit to our data as it have extreme negative values of accuracy score. Thus, these values of accuracy scores imply that the predictions and usage of the model multi layer perceptron to our data is -1949802617711777.8% accurate for the train set and -666073314398989.8% accurate for the test set.

In [28]:
# n_clusters=The number of clusters to form as well as the number of centroids to generate.
kmeans = KMeans(n_clusters=2)
kmeans.fit(X_train, y_train)

# accuracy percentage
accuracy_train = kmeans.score(X_train,y_train)
accuracy_test = kmeans.score(X_test,y_test)

print('Accuracy Train: {0}%'.format(accuracy_train * 100))
print('Accuracy Test: {0}%'.format(accuracy_test * 100))

Accuracy Train: -1949802617711777.8%
Accuracy Test: -666073314398989.8%


#### Prediction Errors
One of the simplest methods for calculating the correctness of a model is to use the error between predicted value and actual value. MSE is the average squared difference between the estimated values and the actual value, MAE tells us how big of an error we can expect from the forecast on average and lastly, RMSE is the square root of average value of squared error in a set of predicted values, without considering direction. <br><br>
As shown below, with the fact that RMSE is higher than the MAE, this implies that there is variation in the magnitude of errors and very large errors are unlikely to have occured. Thus, with RMSE not being too much high than MAE, implies that there is no outlier in the data.

In [29]:
# Get multiple predictions
y_predictions= get_predictions(kmeans, X_test)

model_mse, model_mae, model_rmse = get_errors(y_test, y_predictions)

print("Decision Tree (Gini) Prediction Errors: ")
print("Mean Squared Error {0}".format(model_mse))
print("Mean Absolute Error {0}".format(model_mae))
print("Root Mean Squared Error {0}".format(model_rmse))

Decision Tree (Gini) Prediction Errors: 
Mean Squared Error 14860768886806.25
Mean Absolute Error 3485187.283333333
Root Mean Squared Error 3854966.7815438113


<br>

<br>

### 4. Evaluating the Model