### House price predictions with neural networks

Small example to train a neural network to predict house prices using a simple multi-layer neural network in Keras. Data is available on [Kaggle](https://www.kaggle.com/lodhaad/house-prices).

In [None]:
import pandas as pd
import numpy as np

In [None]:
# First lets read our data into memory and view the top rows using the pandas head() function

data = pd.read_csv('home_data.csv')

data.head()

Understanding your data is one of the most important preprocessing steps before tackling a data science problem. One of the easiest ways to look for initial correlations is to plot a correlation matrix. This can help us determine which columns are important and which columns are expendable. It is important to remember however, whilst some fields may have low correlations in their current form this does not mean they cannot be useful with the aid of some further preprocessing.

In [None]:
data.corr()

In [None]:
corr_mat = data.corr()
# Filter by price column and sort descending
corr_mat['price'].sort_values(ascending=False)

### Data cleaning and preprocessing

The first thing we need to do before we are ready to train a neural network is prepare our data. First we will split our labels from the main dataset and remove any unwanted fields that may confuse the model.

In [None]:
labels = data[['price']]
features = data.drop(['id', 'date', 'price', 'zipcode', 'yr_built', 'condition','yr_renovated', 'lat', 'long', 'sqft_lot15'], axis=1)

print(features.shape, labels.shape)

Scikit learn, one of the largest python machine learning libraries, and keras are both designed to work with pandas dataframes. Therefore, functions from both libraries can be used to aid each other. 

Here we use the scikit learning preprocessing class to scale our input data. This is important as large number can be problematic for neural networks. To account for this we use a StandardScaler which standardised features by removing the mean and scaling to unit variance.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

scaled_features = StandardScaler().fit_transform(features.values)

X_train, X_test, y_train, y_test = train_test_split(scaled_features, labels.values, test_size=0.1, random_state=42)

print('Train Size', y_train.shape, y_train.shape)
print('Test Size', X_test.shape, y_test.shape)

### Creating out model

Now we need to define our model architecture and hyperparameters. The options here define not only the shape of your network but how it learns. This is where we can easily experiment with all the complex underlying mathematical principles behind neural networks.

At the core of keras is the **Sequential** model. Put simply a sequential model is a step-by-step instruction for the network where the output of one line becomes the input of the next. The most important function here is the **Dense** layer. The dense layer multiplies the inputs by the weight matrix and adds the bias.

In [None]:
from keras.models import Sequential
from keras import optimizers
from keras.layers import Dense, Activation

model = Sequential()
model.add(Dense(8, input_dim=X_train.shape[1], kernel_initializer="normal", activation='relu'))
model.add(Dense(4, kernel_initializer="normal", activation='relu'))
model.add(Dense(4, kernel_initializer="normal", activation='relu'))
model.add(Dense(8, kernel_initializer="normal", activation='relu'))
model.add(Dense(1))

model.summary()

Next we set our model hyper-parameters. The key parameters to decide here are the [loss function](https://keras.io/losses/), [optimiser](https://keras.io/optimizers/), [epoch and batch](https://keras.io/getting-started/faq/#what-does-sample-batch-epoch-mean). Understanding each of these and experimenting with different combinations can is the key to a successful model.

In [None]:
# Set learning rate
lr = 0.3

# Set optimiser
opt = optimizers.Adam(lr=lr)

# Compile model
model.compile(loss='mean_squared_error', optimizer=opt, metrics=['mae'])

# Set to variable if you want to store training statistics
history = model.fit(X_train, y_train, epochs=20, batch_size=32)

We can import a plot library to visualise statistics of our model training. This can be very useful for determining if models are still imrpoving, have already converged or are over-fitting.

In [None]:
from matplotlib import pylab as plt
%matplotlib inline

print(history.history.keys())

plt.figure()
plt.plot(history.history['loss'])
plt.show()

### Evaluating model performance

Once you have trained your model its performance needs to be evaluated.  The easiest way to do this is first run your model on your entire test dataset that we set aside earlier. Once we have a list of our results we can use a simple loop to iterate through the results and compare each result with the actual value. 

**Note:** Remember to calcualte an inverse of the scalar we applied earlier to scale the numbers back to there original scale.

In [None]:
from sklearn.metrics import mean_absolute_error

predictions = model.predict(X_test)

mae = mean_absolute_error(y_test, predictions)

print("Total error: $%.2f" %mae)

### Linear regression

Both keras and scikit-learn are designed to take numpy arrays and pandas data frames as inputs. Therefore we can easily pass our training data into a range of scikit-learn regression models such as; [Linear](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html), [Random Forest](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) or [Support Vector Machine](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html) regression model.

In [None]:
from sklearn.linear_model import LinearRegression

regr = LinearRegression()
regr.fit(X_train, y_train)

In [None]:
error = []

l_predictions = regr.predict(X_test)

l_mae = mean_absolute_error(y_test, l_predictions)

print("Total error: $%.2f" %l_mae)