# Addition of a Neural Network Model

After creating the previous three models for consideration, I decided to go back and take a look at the possibility of adding a fourth model. I wanted to see if I can use a model with Deep Neural Networks for a more realistic assessment of housing prices. I was able to reach my goal of coming under $200000, but I believe that I can do even better with the use of a neural network. I am going to bring in my cleaned data so that I can go right into the modeling process. 

# Importing Libraries

In [1]:
import keras
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from keras.wrappers.scikit_learn import KerasRegressor
import pickle

import warnings
warnings.filterwarnings('ignore')

# Bringing in the Cleaned Dataset

In [2]:
#bringing in the cleaned
infile = open("Data/cleaned_data.pickle",'rb')
df = pickle.load(infile)
infile.close()

In [3]:
df.head()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,floors,waterfront,sqft_above,sqft_basement,sqft_living15,log_yard,...,zipcode_98146,zipcode_98148,zipcode_98155,zipcode_98166,zipcode_98168,zipcode_98177,zipcode_98178,zipcode_98188,zipcode_98198,zipcode_98199
0,365000.0,4.0,2.0,2070.0,2.0,0,2070.0,0,2390,8.969287,...,0,0,0,0,0,0,0,0,0,0
1,865000.0,5.0,3.0,2900.0,1.0,0,1830.0,1070,2370,8.25062,...,0,0,0,0,0,0,0,0,0,0
2,1038000.0,4.0,2.0,3770.0,2.0,0,3770.0,0,3710,9.105868,...,0,0,0,0,0,0,0,0,0,0
3,1490000.0,3.0,4.0,4560.0,2.0,0,4560.0,0,4050,9.419628,...,0,0,0,0,0,0,0,0,0,0
4,711000.0,3.0,2.0,2550.0,2.0,0,2550.0,0,2250,8.318986,...,0,0,0,0,0,0,0,0,0,0


# Preparing the Data

In [4]:
#dropping the dependent variable
dvariables = df.iloc[:, 1:]

#this here will become the SHAPE variable in the function below
number_of_columns = len(dvariables.columns)

#isolating the target variable
target = df.iloc[:,0]

In [5]:
#implementing train test split
X_fulltrain, X_fulltest, y_fulltrain, y_fulltest = train_test_split(dvariables, target, random_state = 42, test_size = .2)

#splitting again for a validation set
X_train, X_valid, y_train, y_valid = train_test_split(X_fulltrain, y_fulltrain, random_state = 42, test_size = .2)


# Model Creation

In [6]:
def create_model(optimizer = 'adam', num_layers = 1, activation = 'relu', neurons = 50, drop_out= .2, input_shape = 106, learning_rate = .003):
    """This function takes an optimizer, number of layers, and input_shape. It will create
    and compile the model so that it can be ready to be fit in the next step."""
    model = keras.Sequential()
    model.add(keras.layers.InputLayer(input_shape = (input_shape,)))
    for layer in range(num_layers):
        model.add(keras.layers.Dense(neurons, activation = activation))
    #This dropout should hopefully be preventing overfitting
    model.add(keras.layers.Dropout(drop_out))
    model.add(keras.layers.Activation(activation))
    model.add(keras.layers.Dense(1))
    
    
    model.compile(optimizer, loss = 'mse', metrics = ['mse'])
    return model

In [7]:
model_1 = create_model(num_layers = 2)

In [8]:
#fitting the model
history = model_1.fit(X_train, y_train, epochs = 50, 
                      #utilizing our validation set to test/prevent overfitting
                      validation_data = (X_valid, y_valid)) 



Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50


Epoch 48/50
Epoch 49/50
Epoch 50/50


# Evaluation

In [9]:
print('Evaluation at Epoch 50:\n')
for key in history.history:
    print(key + ':' + str(history.history[key][-1]) + '\n')

print('RMSE :' + str(np.sqrt(history.history['val_mse'][-1])))

Evaluation at Epoch 50:

loss:72900689920.0

mse:72900689920.0

val_loss:68826398720.0

val_mse:68826398720.0

RMSE :262347.85823406297


It is interesting to see here that the train set and validation set have come very close to one another. In terms of the actual metric the model has not crossed the threshold of the goal RMSE. For now I am led to believe that the original chosen model shows the most promise. Nevertheless I will continue on an try to lower the error by tinkering manually with the parameters.

# Second Model Attempt

For an attempt at improvement, I am going to try another model with more layers to see how that would compare to the first attempt. 

In [10]:
#I am increasing the number of layers in this model
model_2 = create_model(num_layers = 8)

In [11]:
history_2 = model_2.fit(X_train, y_train, epochs = 50, validation_data=(X_valid, y_valid))

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50


Epoch 49/50
Epoch 50/50


In [12]:
print('Evaluation at Epoch 50:\n')
for key in history_2.history:
    print(key + ':' + str(history_2.history[key][-1]) + '\n')
    
print('RMSE :' + str(np.sqrt(history_2.history['val_mse'][-1])))

Evaluation at Epoch 50:

loss:51909283840.0

mse:51909283840.0

val_loss:39161872384.0

val_mse:39161872384.0

RMSE :197893.58853687


The error here is reduced from the last attempt. This tells us that more layers might have been needed to get a better estimate of the target. The error measured in the validation set gets us over the threshold and achieves our goal of less than 200000. It does not however compare to the original models created. I think the next best step would be to incorporate randomized search to find optimal parameters.