# Kings County Housing Dataset

## Using Keras to create a model of the data

In general to get a model to be very accurate, we would need data perhaps of photographs of the property, perhaps floor plans, and maybe even more historical data about each properties sale records. What other data sources can you think of that may improve our ability to predict the price of property?

In [6]:
#imports occuring as they are needed
import pandas as pd
pd.options.display.max_columns = None

#load the data as always with pandas
df = pd.read_csv('kc_house_data.csv')
df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


## fields

* date - Date house was sold
* price - Price is prediction target
* bedrooms - Number of Bedrooms/House
* bathrooms - Number of bathrooms/House
* sqft_living - Square footage of the home
* sqft_lot - square footage of the lot
* floors - Total floors (levels) in house
* waterfront - House which has a view to a waterfront
* view - Has been viewed
* condition - How good the condition is ( Overall )
* grade - overall grade given to the housing unit, based on King County grading system
* sqft_above - square footage of house apart from basement
* sqft_basement - square footage of the basement
* yr_built - Built Year
* yr_renovated - Year when house was renovated
* zipcode - zip
* lat - Latitude coordinate
* long - Longitude coordinate
* sqft_living15 - Living room area in 2015(implies-- some renovations) This might or might not have affected the lotsize area
* sqft_lot15 - lotSize area in 2015(implies-- some renovations)

In [7]:
df.dtypes

id                 int64
date              object
price            float64
bedrooms           int64
bathrooms        float64
sqft_living        int64
sqft_lot           int64
floors           float64
waterfront         int64
view               int64
condition          int64
grade              int64
sqft_above         int64
sqft_basement      int64
yr_built           int64
yr_renovated       int64
zipcode            int64
lat              float64
long             float64
sqft_living15      int64
sqft_lot15         int64
dtype: object

In [8]:
#sample the date column to realise the date fomat
df['date'].sample(10)

19712    20141223T000000
14279    20140929T000000
4501     20150404T000000
16004    20140512T000000
5447     20140806T000000
20676    20140515T000000
19621    20141223T000000
3860     20150422T000000
13060    20150429T000000
12467    20140823T000000
Name: date, dtype: object

In [9]:
#split the date out into different properties
df["date"] = pd.to_datetime(df["date"])
df["date_year"] = df["date"].dt.year
df["date_month"] = df["date"].dt.month
df['date_quarter'] = df['date'].dt.quarter
df["date_day"] = df["date"].dt.day
df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,date_year,date_month,date_quarter,date_day
0,7129300520,2014-10-13,221900.0,3,1.0,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650,2014,10,4,13
1,6414100192,2014-12-09,538000.0,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639,2014,12,4,9
2,5631500400,2015-02-25,180000.0,2,1.0,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062,2015,2,1,25
3,2487200875,2014-12-09,604000.0,4,3.0,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000,2014,12,4,9
4,1954400510,2015-02-18,510000.0,3,2.0,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503,2015,2,1,18


In [10]:
#some columns are not useful (see field values)
df.drop(columns=['id', 'date', 'view', 'yr_renovated'], inplace=True)

In [11]:
df.head()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,condition,grade,sqft_above,sqft_basement,yr_built,zipcode,lat,long,sqft_living15,sqft_lot15,date_year,date_month,date_quarter,date_day
0,221900.0,3,1.0,1180,5650,1.0,0,3,7,1180,0,1955,98178,47.5112,-122.257,1340,5650,2014,10,4,13
1,538000.0,3,2.25,2570,7242,2.0,0,3,7,2170,400,1951,98125,47.721,-122.319,1690,7639,2014,12,4,9
2,180000.0,2,1.0,770,10000,1.0,0,3,6,770,0,1933,98028,47.7379,-122.233,2720,8062,2015,2,1,25
3,604000.0,4,3.0,1960,5000,1.0,0,5,7,1050,910,1965,98136,47.5208,-122.393,1360,5000,2014,12,4,9
4,510000.0,3,2.0,1680,8080,1.0,0,3,8,1680,0,1987,98074,47.6168,-122.045,1800,7503,2015,2,1,18


In [12]:
X = df.drop(['price'], axis=1).values
Y = df['price'].values

In [13]:
from sklearn.model_selection import train_test_split

#split the data and scale training seperate to testing otherwise will leak test data into training data
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

#create a seperation between training and val/testing
X_train, X_val_and_test, Y_train, Y_val_and_test = train_test_split(X, Y, test_size=0.3)

#now we can scale

X_train = sc.fit_transform(X_train)
X_val_and_test = sc.transform (X_val_and_test)


#split the test and validation data 
# (validation is used after each epoch to test the accuracy of that epoch. Test is used after training)
X_val, X_test, Y_val, Y_test = train_test_split(X_val_and_test, Y_val_and_test, test_size=0.5)
print(X_train.shape, X_val.shape, X_test.shape, Y_train.shape, Y_val.shape, Y_test.shape)

(15129, 20) (3242, 20) (3242, 20) (15129,) (3242,) (3242,)


## Linear regression 

To get a bench mark of the model

In [14]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, Y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [15]:
y_pred = regressor.predict(X_test)
print('Liner Regression R squared: %.4f' % regressor.score(X_test, Y_test))

Liner Regression R squared: 0.6875


In [16]:
import numpy as np
from sklearn.metrics import mean_squared_error
lin_mse = mean_squared_error(y_pred, Y_test)
lin_rmse = np.sqrt(lin_mse)
print('Liner Regression RMSE: %.4f' % lin_rmse)

Liner Regression RMSE: 200648.9752


In [17]:
from sklearn.metrics import mean_absolute_error

lin_mae = mean_absolute_error(y_pred, Y_test)
print('Liner Regression MAE: %.4f' % lin_mae)

Liner Regression MAE: 128794.2980


In [18]:
import keras

from keras.models import Sequential
from keras.layers import Dense, Dropout, BatchNormalization, Activation #activation might be better split out
from keras.optimizers import Adam, RMSprop
from keras.callbacks import TensorBoard, EarlyStopping, ModelCheckpoint

from keras import metrics

from keras.utils import plot_model
from keras.models import load_model

from keras import optimizers
import time

Using TensorFlow backend.


## Notice

Note that the plot shows validation error as less than training error, which is quite deceptive. The reason for this is that training error is calculated for the entire epoch (and at its begining it was much worse than at the end), whereas the validation error is taken from the last batch (after the model improved). 

In [99]:
#set the learning rate
lr = 0.3
opt = optimizers.Adam(lr=lr)

#all tanh fail

In [100]:
now = time.strftime("run_%c") #record the current time

#network callbacks.
tensorboard = TensorBoard(log_dir='./tensorboards/'+now + "_" + str(lr), histogram_freq=0, write_graph=True)
#earlyStopping = EarlyStopping(monitor='val_mean_absolute_error', patience=20, verbose=1)
checkpointSave = ModelCheckpoint(filepath='./best_model_'+now+'.h5', monitor='val_mean_absolute_error', save_best_only=True)

n_cols = X_train.shape[1]

#NOTE: Simpler models may work and improve on the linear regression, but 2/3 rule is just a rule of thumb and
#I have improved models not using this rule
#It makes sense as there are large gaps in the data (we don't have data for every possible type of property)
t_model = Sequential()

#1 exerise - remove layers
#2 exercise, change number of neurons in layers
#3 exercise change/remove the drop out layers and see affect of over fitting
#4 exercise change the learning rate

t_model.add(Dense(12, activation="relu", kernel_initializer='normal', input_shape=(n_cols,)))
#t_model.add(Dropout(0.2)) #exercise, get the room to define different drop outs perhaps?

t_model.add(Dense(4, activation="relu", kernel_initializer='normal')) 
#t_model.add(Dropout(0.1))

#t_model.add(Dense(12, activation="relu", kernel_initializer='normal'))
#t_model.add(Dropout(0.1))

t_model.add(Dense(4, activation="relu", kernel_initializer='normal'))
#t_model.add(Dropout(0.1))

t_model.add(Dense(12, activation="relu", kernel_initializer='normal'))

t_model.add(Dense(1))

t_model.compile(
    loss='mean_squared_error',
    optimizer=opt,
    metrics=[metrics.mae])
t_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_159 (Dense)            (None, 12)                252       
_________________________________________________________________
dense_160 (Dense)            (None, 4)                 52        
_________________________________________________________________
dense_161 (Dense)            (None, 4)                 20        
_________________________________________________________________
dense_162 (Dense)            (None, 12)                60        
_________________________________________________________________
dense_163 (Dense)            (None, 1)                 13        
Total params: 397
Trainable params: 397
Non-trainable params: 0
_________________________________________________________________


In [101]:
t_model.fit(X_train, Y_train,
    batch_size=25,
    epochs=100,
    verbose=1, # Change it to 2, if wished to observe execution
    validation_data=(X_val, Y_val),
    callbacks=[tensorboard, #earlyStopping
              ])

Train on 15129 samples, validate on 3242 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100


Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100


Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<keras.callbacks.History at 0x1393c4cc0>

In [None]:
tensorboard

In [85]:
train_score = t_model.evaluate(X_train, Y_train, verbose=0)
valid_score = t_model.evaluate(X_val, Y_val, verbose=0)
test_score = t_model.evaluate(X_test, Y_test, verbose=0)

print(train_score)
print(valid_score)
print(test_score)
print('Train MAE: ', round(train_score[1], 4), ', Train Loss: ', round(train_score[0], 4)) 
print('Val MAE: ', round(valid_score[1], 4), ', Val Loss: ', round(valid_score[0], 4))
print('Test MAE: ', round(test_score[1], 4), ', Test Loss: ', round(test_score[0], 4))

[139895116821.0499, 231751.6209514839]
[117636728967.18568, 230717.96310148056]
[128859685020.66379, 229010.96377621838]
Train MAE:  231751.621 , Train Loss:  139895116821.0499
Val MAE:  230717.9631 , Val Loss:  117636728967.1857
Test MAE:  229010.9638 , Test Loss:  128859685020.6638


# things for them to learn 

* different activation functions
* explanation of how the activation function can model a non linear function
* https://missinglink.ai/guides/neural-network-concepts/7-types-neural-network-activation-functions-right/
* https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6
* understand dropout, and regularization as techniques
* understand that more complex models are used for image recognition
* look at a model for image recognition
* understand what a convolution is
* understand there are a lot of tools and techniques to the NN and will take time resource and learning to experience
* need to watch the string of videos


## exercise

* add in drop out (different values for each network)
* and in regularization to second layer

* https://ml4a.github.io/ml4a/neural_networks/