# Assignment Instructions

This assignment asks you to train a neural network to predict housing prices.  I provide you with two different datasets.  You will use one data set, which includes house prices for training.  A second dataset, which does not include house prices, will be used for prediction and be submitted for evaluation.  I also give you a third dataset that contains median income for zip codes that must be joined to both the training/test datasets to provide an additional input value.  You must use the median income with your inputs for extra predictive power.

You can find all of the needed CSV files here:

* [House Prices - Training](https://data.heatonresearch.com/data/t81-558/datasets/houses_train.csv)
* [House Prices - Submit](https://data.heatonresearch.com/data/t81-558/datasets/houses_test.csv)
* [Median Income by Zipcode](https://data.heatonresearch.com/data/t81-558/datasets/zips.csv)

The median income by zipcode provides an additional feature, median income, that you should use in your predictions.  To complete this assignment perform the following steps:

* Load the housing prices training data.
* Join the median income by zipcode to the training data so that you gain the median income.
* Train a model to predict house price when given the following inputs: 'bedrooms', 'bathrooms', 'garage', 'land', 'sqft', 'median_income'.
* Load the housing prices test data.  This data does not contain the house price, you must predict this.
* Join the median income by zipcode to the test/submit data to gain the median income.
* Predict prices for the test/submit data.  
* Create a submission dataset that contains the house id (from the test/submit data) and the predicted price for that house.  Include no other fields.
* Submit this dataset and see how close you are to the actual values.

Predicting the house prices with less than +/- $10,000 is sufficient to complete the assignment.  You may also wish to see if you can get your prediction even more accurate.

In [14]:
# libraries

import os
import pandas as pd
from scipy.stats import zscore
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout
import pandas as pd
import io
import requests
import numpy as np
from sklearn import metrics
from sklearn.model_selection import KFold
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from scipy.stats import zscore

In [15]:
# loading data sets

df_houses_train = pd.read_csv("https://data.heatonresearch.com/data/t81-558/datasets/houses_train.csv")
df_houses_submit = pd.read_csv("https://data.heatonresearch.com/data/t81-558/datasets/houses_test.csv")
df_zips = pd.read_csv("https://data.heatonresearch.com/data/t81-558/datasets/zips.csv")

# Indexing by zip 

df_zips = df_zips.reset_index().set_index('zip').drop(['index'],axis =1)


In [16]:
# join the median income by zipcode to the training data - both datasets - optimizing resources


# Adding the median income per zip to both data frames (submit - train)

df_houses_submit['median_income'] = len(df_houses_submit.zip.values)*[0]
df_houses_train['median_income'] = len(df_houses_train.zip.values)*[0]


for i in df_zips.index:
    df_houses_submit.loc[df_houses_submit['median_income'][df_houses_submit.zip == i].index.values,'median_income'] =  df_zips.median_income[i]
    df_houses_train.loc[df_houses_train['median_income'][df_houses_train.zip == i].index.values,'median_income'] =  df_zips.median_income[i]
    

In [24]:
# Train a model to predict house price when given the following inputs: 
# 'bedrooms', 'bathrooms', 'garage', 'land', 'sqft', 'median_income'.

y = df_houses_train.price.values
x = df_houses_train.drop(df_houses_train.columns.difference(['bedrooms', 'bathrooms', 'garage', 'land', 'sqft', 'median_income']), axis = 1).values
y.shape


(10000,)

In [None]:
# Neural network predicting house pricing - regularization methods are applied

from sklearn import metrics
from sklearn.model_selection import KFold
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras import regularizers


# Cross-Validate
kf = KFold(5, shuffle=True, random_state=42) # Use for KFold classification
    
oos_y = []
oos_pred = []

fold = 0
for train, test in kf.split(x):
    fold+=1
    print(f"Fold #{fold}")
        
    x_train = x[train]
    y_train = y[train]
    x_test = x[test]
    y_test = y[test]
    
    model = Sequential()
    model.add(Dense(70, input_dim=x.shape[1], activation='relu'))
    model.add(Dense(40, activation='relu'))
    model.add(Dense(1))
    model.compile(loss='mean_squared_error', optimizer='adam')
    
    model.fit(x_train,y_train,validation_data=(x_test,y_test),verbose=0,epochs=1500)
    
    pred = model.predict(x_test)
    
    oos_y.append(y_test)
    oos_pred.append(pred)    

    # Measure this fold's RMSE
    score = np.sqrt(metrics.mean_squared_error(pred,y_test))
    print(f"Fold score (RMSE): {score}")

# Build the oos prediction list and calculate the error.
oos_y = np.concatenate(oos_y)
oos_pred = np.concatenate(oos_pred)
score = np.sqrt(metrics.mean_squared_error(oos_pred,oos_y))
print(f"Final, out of sample score (RMSE): {score}")    
    
# Write the cross-validated prediction
oos_y = pd.DataFrame(oos_y)
oos_pred = pd.DataFrame(oos_pred)
oosDF = pd.concat( [oos_y, oos_pred],axis=1 )
    



Fold #1
Fold score (RMSE): 836.5685829034901
Fold #2
Fold score (RMSE): 754.3782909331413
Fold #3


In [87]:
# Prediction for house where we don't know the prices (this is 'submit' dataset)
# using 'bedrooms', 'bathrooms', 'garage', 'land', 'sqft', 'median_income'.

x_submit = df_houses_submit.drop(df_houses_submit.columns.difference(['bedrooms', 'bathrooms', 'garage', 'land', 'sqft', 'median_income']), axis = 1).values
y_submit = model.predict(x_submit)
ident = df_houses_submit.id.values

In [95]:
# Creating the submit dataframe containing the house id (from the test/submit data) 
# and the predicted price for that house. Include no other fields.

submit = pd.DataFrame()
submit['Id'] = ident
submit['Predited_price'] = y_submit
submit


Unnamed: 0,Id,Predited_price
0,10001,1.104945e+06
1,10002,8.126088e+05
2,10003,2.885584e+05
3,10004,2.345250e+05
4,10005,3.605716e+05
...,...,...
1995,11996,2.747598e+05
1996,11997,1.149881e+06
1997,11998,1.095591e+06
1998,11999,5.129116e+05
