# Addition of a Neural Network Model

After creating the previous three models for consideration, I decided to go back and take a look at the possibility of adding a fourth model. I wanted to see if I can use a model with Deep Neural Networks for a more realistic assessment of housing prices. I was able to reach my goal of coming under $200000, but I believe that I can do even better with the use of a neural network. I am going to bring in my cleaned data so that I can go right into the modeling process. 

# Importing Libraries

In [1]:
import keras
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from keras.wrappers.scikit_learn import KerasRegressor
import pickle

# Bringing in the Cleaned Dataset

In [2]:
#bringing in the cleaned
infile = open("Data/cleaned_data.pickle",'rb')
df = pickle.load(infile)
infile.close()

In [3]:
df.head()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,floors,waterfront,sqft_above,sqft_basement,sqft_living15,log_yard,...,zipcode_98146,zipcode_98148,zipcode_98155,zipcode_98166,zipcode_98168,zipcode_98177,zipcode_98178,zipcode_98188,zipcode_98198,zipcode_98199
0,365000.0,4.0,2.0,2070.0,2.0,0,2070.0,0,2390,8.969287,...,0,0,0,0,0,0,0,0,0,0
1,865000.0,5.0,3.0,2900.0,1.0,0,1830.0,1070,2370,8.25062,...,0,0,0,0,0,0,0,0,0,0
2,1038000.0,4.0,2.0,3770.0,2.0,0,3770.0,0,3710,9.105868,...,0,0,0,0,0,0,0,0,0,0
3,1490000.0,3.0,4.0,4560.0,2.0,0,4560.0,0,4050,9.419628,...,0,0,0,0,0,0,0,0,0,0
4,711000.0,3.0,2.0,2550.0,2.0,0,2550.0,0,2250,8.318986,...,0,0,0,0,0,0,0,0,0,0


# Preparing the Data

In [4]:
#dropping the dependent variable
dvariables = df.iloc[:, 1:]

#this here will become the SHAPE variable in the function below
number_of_columns = len(dvariables.columns)

#isolating the target variable
target = df.iloc[:,0]

In [5]:
#implementing train test split
X_fulltrain, X_fulltest, y_fulltrain, y_fulltest = train_test_split(dvariables, target, random_state = 42, test_size = .2)

#splitting again for a validation set
X_train, X_valid, y_train, y_valid = train_test_split(X_fulltrain, y_fulltrain, random_state = 42, test_size = .2)


In [6]:
#defining the metric for the model
def root_mean_squared_error(y_true, y_pred):
    """This is a helper function to allow root_mean_squared_error
    as the evaluation metric"""
    return keras.backend.sqrt(keras.backend.mean(keras.backend.square(y_pred - y_true))) 

# Model Creation

In [15]:
X_train.shape[1]

106

In [7]:
def create_model(num_layers = 1, shape = 106, optimizer = 'adam'):
    """This function takes in the number of hidden layers, shape of the input, 
    and the optimizer to create a model ready to be fit."""
    model = keras.Sequential()
    model.add(keras.layers.Dense(shape, activation = 'relu', input_shape=(shape, )))
    for layer in range(num_layers):
        model.add(keras.layers.Dense(shape, activation = 'relu'))
    model.add(keras.layers.Dense(1))
    
    
    model.compile(optimizer, loss = root_mean_squared_error, metrics = [root_mean_squared_error])
    return model

In [8]:
model_1 = create_model(2, number_of_columns, 'adam')

In [9]:
#fitting the model
history = model_1.fit(X_train, y_train, epochs = 30, validation_data = (X_valid, y_valid)) #creating a small validation set to assess overfitting

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


# Evaluation

In [10]:
print('Evaluation at Epoch 50:\n')
for key in history.history:
    print(key + ':' + str(history.history[key][-1]) + '\n')

Evaluation at Epoch 50:

loss:231196.359375

root_mean_squared_error:231147.609375

val_loss:240313.796875

val_root_mean_squared_error:239650.109375



It is interesting to see here that the train set and validation set have come very close to one another. In terms of the actual metric the model has not crossed the threshold of the goal RMSE. For now I am led to believe that the original chosen model shows the most promise.

# Second Model Attempt

For an attempt at improvement, I am going to try another model with more layers to see how that would compare to the first attempt. 

In [11]:
model_2 = create_model(8, number_of_columns, 'adam')

In [12]:
history_2 = model_2.fit(X_train, y_train, epochs = 30, validation_data=(X_valid, y_valid))

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


In [13]:
print('Evaluation at Epoch 50:\n')
for key in history_2.history:
    print(key + ':' + str(history_2.history[key][-1]) + '\n')

Evaluation at Epoch 50:

loss:190927.140625

root_mean_squared_error:190863.578125

val_loss:189786.890625

val_root_mean_squared_error:189243.875



The error here is reduced in addition to the validation set performing better. This tells us that there's no overfitting, and more layers were needed to get a better estimate of the target. There is an issue of slight overfitting, even with a validation set. I think the next best step would be to incorporate random search to find optimal parameters.