# Machine Learning Regression to Predict House Prices using ANN 

## Introduction and Statement of the Problem

- Dataset includes house sale prices for King County in USA. 
- Houses that are sold in the time period: May, 2014 and May, 2015.
- Data: https://www.kaggle.com/harlfoxem/housesalesprediction

- Interpretation of the Columns:
    - ida: notation for a house
    - date: Date house was sold
    - price: Price is prediction target
    - bedrooms: Number of Bedrooms/House
    - bathrooms: Number of bathrooms/House
    - sqft_living: square footage of the home
    - sqft_lot: square footage of the lot
    - floors: Total floors (levels) in house
    - waterfront: House which has a view to a waterfront
    - view: Has been viewed
    - condition: How good the condition is ( Overall )
    - grade: overall grade given to the housing unit, based on King County grading system
    - sqft_abovesquare: footage of house apart from basement
    - sqft_basement: square footage of the basement
    - yr_built: Built Year
    - yr_renovated: Year when house was renovated
    - zipcode: zip
    - lat: Latitude coordinate
    - long: Longitude coordinate
    - sqft_living15: Living room area in 2015(implies -- some renovations) 
    - sqft_lot15: lotSize area in 2015(implies -- some renovations)

## Let's Start The Project by Loading the Libraries and the Dataset

In [None]:
#Importing the Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#Importing the Dataset
house_data = pd.read_csv('../input/kc_house_data.csv')

In [None]:
#Let's see how many rows and columns of data are available in the dataset
house_data

The dataset includes 21,613 rows and with 21 columns

In [None]:
#Looking at the top 10 rows of the dataset
house_data.head(10)

In [None]:
#Looking at the bottom 10 of the dataset
house_data.tail(10)

Let's take a look at the summary statistics of the data

In [None]:
house_data.describe()

By looking at the Summary statistics of the data, we can see the following:
 1. The most expensive house costs 7,700,000.00 dollars and the least expensive costs 75,000.00 dollars
 2. The oldest house being sold was built on 1900s and the most recently built house was on 2015.
 


Let's now check if the data have no missing values.

In [None]:
house_data.info()

The dataset is complete. Let's now go to Visualizing the data.

## Data Visualization

    Data Visualization is really important in all Machine Learning work. You have to know what you're working on, know what are the important features in the data, and what are the insights that you can get in the data before training the Machine Learning Model.

Here's the data again.

In [None]:
house_data

Let's take a look at the columns available.

In [None]:
house_data.columns

Looking at the relationship between the area of the house(sqft_living) and the price.

In [None]:
plt.figure(figsize=(20, 10))
sns.scatterplot(x = house_data['price'], y = house_data['sqft_living'], color = 'g')

Also for the relationship between the area of the lot(sqft_lot) and the price.

In [None]:
plt.figure(figsize=(20, 10))
sns.scatterplot(x = house_data['price'], y = house_data['sqft_lot'], color = 'g')

It can be seen that the bigger the house, the higher the price. Which is logically right.
When it comes to the Area of the Lot, it seems that it is not affecting the price of the house at all.

Let's take a look now at the distribution of all the data in each column.

In [None]:
house_data.hist(bins=20, figsize=(20, 20), color = 'g')

We can also see the correlation of each features to each other using a Heat map.

In [None]:
plt.figure(figsize=(20, 20))
sns.heatmap(house_data.corr(), annot=True)

To see if there's more insight we can extract in the data, we will plot the data using Pair Plots.

In [None]:
sns.pairplot(house_data)

Because of the number of the columns in the dataset, the plots becomes too small to read. We will only consider some of the features that looks more "important" than the others.

In [None]:
house_data_important = house_data[ ['price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'sqft_above', 'sqft_basement', 'yr_built']   ]

In [None]:
house_data_important

In [None]:
sns.pairplot(house_data_important)

## Splitting and Scaling the Dataset

Before model training, we have to first split the dataset into the features to train the model with, and the output that we want the model to predict. We will call the features, X, and the output as y. 

In [None]:
#We will use only the important features that we set in the Visualization part.
X = house_data_important.drop(['price'], axis =1)
y = house_data['price']

In [None]:
#Let's take a look at X
X.head(10)

In [None]:
X.shape

In [None]:
#and y
y.head(10)

In [None]:
y.shape

When working with Neural Networks, one of the most important step to perform before model training is Data Scaling. Neural Networks needs to be fed by values in the same range to obtain much better results. 

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

In [None]:
X_scaled

In [None]:
# Scaling y
y = y.values.reshape(-1, 1)

In [None]:
y_scaled = scaler.fit_transform(y)

In [None]:
y_scaled

## Splitting the Dataset to Training and Testing 

We will split the dataset into the ones that we will use for training and for the ones that we will use for testing the performance of our model.

In [None]:
#We will be splitting the data into 75% for training and 25% for testing.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_scaled, test_size = 0.25)

In [None]:
print("We have", X_train.shape, "for training and", X_test.shape, "for testing.")

## Building and Training the Model

In this project we will use Artificial Neural Networks to predict the house prices based on the features given in the dataset and on new set of data.

In [None]:
#We will use keras in building our ANN Model. Let's start by importing the needed libraries
import keras
from keras.models import Sequential
from keras.layers import Dense

In [None]:
#Lets start building the model.
model = Sequential()
model.add(Dense(32, input_dim = 7, activation = 'relu'))
model.add(Dense(64, activation = 'relu'))
model.add(Dense(128, activation = 'relu'))
model.add(Dense(1, activation = 'linear'))
model.compile(loss = 'mean_squared_error', optimizer = 'Adam')
model.summary()

In [None]:
#We will train the model and store all the history of the training in the variable history_epoch
history_epoch = model.fit(X_train, y_train, batch_size = 32, epochs = 50, validation_split=0.2)

## Evaluating the Model

There are many ways to check the performance of a regression model the most widely used ones are the: Root Mean Squared Error (RMSE), Mean Squared Error (MSE), Mean Absolute Error (MAE), R-Squared (R2) and Adjusted R-Squared.

In [None]:
#First lets visualize how the loss in the training differs from the validation
history_epoch.history.keys()

In [None]:
plt.plot(history_epoch.history['loss'])
plt.plot(history_epoch.history['val_loss'])
plt.title('Model Loss During Training')
plt.xlabel('Epoch')
plt.ylabel('Training and Validation Loss')
plt.legend(['Training Loss', 'Validation Loss'])

Looking at the graph above, we can see that the model is not generalizing well like how we need it to be.

In [None]:
#Let's do the prediction for the test data
y_predictions = model.predict(X_test)

In [None]:
#Transform back y_predict and y_test into its original values.
orig_y_test = scaler.inverse_transform(y_test)
orig_y_predict = scaler.inverse_transform(y_predictions)

In [None]:
n = len(X_test)
k = X_test.shape[1]

In [None]:
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from math import sqrt

RMSE = float(format(np.sqrt(mean_squared_error(orig_y_test, orig_y_predict)),'.3f'))
MSE = mean_squared_error(orig_y_test, orig_y_predict)
MAE = mean_absolute_error(orig_y_test, orig_y_predict)
r2 = r2_score(orig_y_test, orig_y_predict)
adj_r2 = 1-(1-r2)*(n-1)/(n-k-1)

print('RMSE =',RMSE, '\nMSE =',MSE, '\nMAE =',MAE, '\nR2 =', r2, '\nAdjusted R2 =', adj_r2) 


From the plot and the metrics, we can see conclude that the model we have built is not good. But it still can be improved. Let's now move on to improving the model.

## Improving the Model

We will now look into some ways to how can we improve the model. We will try adding more features and adjust our model parameters.

In [None]:
#We will be using more features
more_features = ['bedrooms','bathrooms','sqft_living','sqft_lot','floors', 'sqft_above', 'sqft_basement', 'waterfront', 'view', 'condition', 'grade', 'sqft_above', 'yr_built', 
'yr_renovated', 'zipcode', 'lat', 'long', 'sqft_living15', 'sqft_lot15']

X_2 = house_data[more_features]

In [None]:
#Scaling
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_2_scaled = scaler.fit_transform(X_2)

In [None]:
y_2 = house_data['price']
y_2 = y_2.values.reshape(-1,1)
y_2_scaled = scaler.fit_transform(y_2)

In [None]:
#Splitting the dataset into training and testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_2_scaled, y_2_scaled, test_size = 0.25)

In [None]:
#Lets start building the model again and 1 more layer.
model = Sequential()
model.add(Dense(32, input_dim = 19, activation = 'relu'))
model.add(Dense(64, activation = 'relu'))
model.add(Dense(128, activation = 'relu'))
model.add(Dense(256, activation = 'relu'))
model.add(Dense(1, activation = 'linear'))
model.compile(loss = 'mean_squared_error', optimizer = 'Adam')
model.summary()

In [None]:
history_epoch_2 = model.fit(X_train, y_train, batch_size = 32, epochs = 50, validation_split=0.2)

In [None]:
#Visualizing the losses
plt.plot(history_epoch_2.history['loss'])
plt.plot(history_epoch_2.history['val_loss'])
plt.title('Model Loss During Training')
plt.ylabel('Training and Validation Loss')
plt.xlabel('Epoch number')
plt.legend(['Training Loss', 'Validation Loss'])

In [None]:
#Let's do the prediction for the test data
y_predictions_2 = model.predict(X_test)

In [None]:
#Transform back y_predict and y_test into its original values.
orig_y_test = scaler.inverse_transform(y_test)
orig_y_predict_2 = scaler.inverse_transform(y_predictions_2)

In [None]:
n = len(X_test)
k = X_test.shape[1]

from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from math import sqrt

RMSE = float(format(np.sqrt(mean_squared_error(orig_y_test, orig_y_predict_2)),'.3f'))
MSE = mean_squared_error(orig_y_test, orig_y_predict_2)
MAE = mean_absolute_error(orig_y_test, orig_y_predict_2)
r2 = r2_score(orig_y_test, orig_y_predict_2)
adj_r2 = 1-(1-r2)*(n-1)/(n-k-1)

print('RMSE =',RMSE, '\nMSE =',MSE, '\nMAE =',MAE, '\nR2 =', r2, '\nAdjusted R2 =', adj_r2) 


### Adding more features to the model really gives the accuracy of the model a big boost. The adjusted R2 is closer to 1.0 which means that the predictions made by the model to the test set is close to the actual values. 