## Artificial Neural Network - House Price Prediction

Artificial neural network to solve classification problem. To determine if the house price is above the median price or not.


In [1]:
# Dataset
# https://drive.google.com/file/d/1GfvKA0qznNVknghV4botnNxyH-KvODOC/view

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf

from tensorflow import keras
from sklearn.model_selection import train_test_split

In [3]:
houseprice_df = pd.read_csv('housepricedata.csv')

In [4]:
# check the dimesions of dataset
houseprice_df.shape

(1460, 11)

In [5]:
# View first rows of dataset
houseprice_df.head()

Unnamed: 0,LotArea,OverallQual,OverallCond,TotalBsmtSF,FullBath,HalfBath,BedroomAbvGr,TotRmsAbvGrd,Fireplaces,GarageArea,AboveMedianPrice
0,8450,7,5,856,2,1,3,8,0,548,1
1,9600,6,8,1262,2,0,3,6,1,460,1
2,11250,7,5,920,2,1,3,6,1,608,1
3,9550,7,5,756,1,0,3,7,1,642,0
4,14260,8,5,1145,2,1,4,9,1,836,1


Last column 'AboveMedianPrice' determines if the house is above or below the median price.

In [6]:
# view stats of dataset
houseprice_df.describe()

Unnamed: 0,LotArea,OverallQual,OverallCond,TotalBsmtSF,FullBath,HalfBath,BedroomAbvGr,TotRmsAbvGrd,Fireplaces,GarageArea,AboveMedianPrice
count,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,10516.828082,6.099315,5.575342,1057.429452,1.565068,0.382877,2.866438,6.517808,0.613014,472.980137,0.49863
std,9981.264932,1.382997,1.112799,438.705324,0.550916,0.502885,0.815778,1.625393,0.644666,213.804841,0.500169
min,1300.0,1.0,1.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0
25%,7553.5,5.0,5.0,795.75,1.0,0.0,2.0,5.0,0.0,334.5,0.0
50%,9478.5,6.0,5.0,991.5,2.0,0.0,3.0,6.0,1.0,480.0,0.0
75%,11601.5,7.0,6.0,1298.25,2.0,1.0,3.0,7.0,1.0,576.0,1.0
max,215245.0,10.0,9.0,6110.0,3.0,2.0,8.0,14.0,3.0,1418.0,1.0


### Separate the data into features and labels

In [7]:
# labels: is a result we want
# features: the variables we can use to classify the house price

In [8]:
# Define labels as (y) and features as (x)

y = houseprice_df['AboveMedianPrice']                      # Target variable (label)
x = houseprice_df.drop(columns=['AboveMedianPrice'])       # the remaining 20 variables (features)

In [9]:
# check the first 5 observations of the labels (y)
y[0:5]

0    1
1    1
2    1
3    0
4    1
Name: AboveMedianPrice, dtype: int64

We can see the output above that the median house price is either 0 or 1.

In [10]:
# check the first 5 observations of all the features
x[0:5]

Unnamed: 0,LotArea,OverallQual,OverallCond,TotalBsmtSF,FullBath,HalfBath,BedroomAbvGr,TotRmsAbvGrd,Fireplaces,GarageArea
0,8450,7,5,856,2,1,3,8,0,548
1,9600,6,8,1262,2,0,3,6,1,460
2,11250,7,5,920,2,1,3,6,1,608
3,9550,7,5,756,1,0,3,7,1,642
4,14260,8,5,1145,2,1,4,9,1,836


We can see the above output shows the 10 columns (10 variables for the features)

As all of the data is already numerical, we don't need to convert them.

In [11]:
# Each row now needs to be converted into a matrix                      

y = y.values.astype('float32')
x = x.values.astype('float32')

### Split data into Train and Test sets

In [12]:
# split the features and labels data (70/30)

# Test dataset:

y_train, y_test, x_train, x_test = train_test_split(y, x, test_size = 0.3)

In [13]:
# Validation dataset:

y_train, y_validation, x_train, x_validation = train_test_split(y_train, x_train, test_size = 0.3)

## Create Model

In [14]:
# Using keras sequential model. 
# choose number of layers to add to model, and what type of activation function you want for each layer.
model = keras.Sequential([keras.layers.Dense(32, input_shape = (10,)),      # Input layer: 32 nodes, (10 feature columns)
                          keras.layers.Dense(5, activation = tf.nn.relu),   # Hidden layers: activation used: tensorflow rectified linear unit
                          keras.layers.Dense(2, activation = 'softmax')])   # Output layer: 2 nodes at output (above or below median price). activation used: softmax

In [15]:
# Compile model
model.compile(optimizer = 'adam',                       #  adam optimisation for model                   
             loss = 'sparse_categorical_crossentropy',  # loss function: what aspect we want to minimise as model progresses
             metrics = ['acc'])                         # we want to analyse accuracy metric

In [16]:
# Fit data to model
history = model.fit(x_train, y_train, epochs=25, validation_data=(x_validation, y_validation))  # epochs: how many times we want to repeat training

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


In [17]:
# Test model on TEST dataset:

prediction_features = model.predict(x_test)     # Appplying the model on the features 
performance = model.evaluate(x_test, y_test)    # check performance based on the TEST features and TEST labels 
print(performance)

[0.6928657293319702, 0.5136986374855042]


From looking at the results of testing the model on the TEST dataset, we can see that we get an accuracy of 51.3%