<a href="https://colab.research.google.com/github/vssood/WU_DL/blob/master/Assignments/WU_DL_AS5_Kfold.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 5
### Problem Statement

The median income by zipcode provides an additional feature, median income, that you should use in your predictions. To complete this assignment perform the following steps:

1. Load the housing prices training data.

2. Join the median income by zipcode to the training data so that you gain the median income.

3. Train a model to predict house price when given the following inputs: 'bedrooms', 'bathrooms', 'garage', 'land', 'sqft', 'median_income'.
Load the housing prices test data. This data does not contain the house price, you must predict this.

4. Join the median income by zipcode to the test/submit data to gain the median income.

5. Predict prices for the test/submit data.

6. Create a submission dataset that contains the house id (from the test/submit data) and the predicted price for that house. Include no other fields.
Submit this dataset and see how close you are to the actual values.

https://github.com/jeffheaton/t81_558_deep_learning/blob/df29ce2413c1ef32acaf99764c54b1b529cd8779/assignments/assignment_yourname_class5.ipynb


In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.model_selection import KFold

from scipy.stats import zscore 

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

**Save and Load Model** 

https://www.tensorflow.org/tutorials/keras/save_and_load


## Read, validate, join & Scrub data 

In [None]:
train_path = "https://data.heatonresearch.com/data/t81-558/datasets/houses_train.csv"
test_path = "https://data.heatonresearch.com/data/t81-558/datasets/houses_test.csv"
zip_path = "https://data.heatonresearch.com/data/t81-558/datasets/zips.csv"


In [None]:
df_train = pd.read_csv(train_path, na_values=["NA", "?"])
df_submit = pd.read_csv(test_path, na_values= ["NA", "?"])
df_zip = pd.read_csv(zip_path, na_values=["NA", "?"])

In [None]:
print(df_train.shape, df_submit.shape, df_zip.shape)

(10000, 8) (2000, 7) (50, 2)


In [None]:
df_train.head()

Unnamed: 0,id,zip,bedrooms,bathrooms,garage,land,sqft,price
0,1,60019,9,2,3,2.198,4860,1005580
1,2,60049,5,2,2,4.517,2870,620278
2,3,60011,2,1,0,4.12,1220,265711
3,4,60027,6,4,2,3.201,3810,819916
4,5,60001,9,3,2,1.347,5061,1039491


In [None]:
df_train.describe()

Unnamed: 0,id,zip,bedrooms,bathrooms,garage,land,sqft,price
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,5000.5,60024.311,4.9988,2.3032,1.4467,2.586615,2961.4875,622229.4
std,2886.89568,14.505796,2.577918,1.623679,0.953336,1.366607,1550.644456,319141.6
min,1.0,60000.0,1.0,1.0,0.0,0.25,650.0,118364.0
25%,2500.75,60012.0,3.0,1.0,1.0,1.409,1676.0,354009.2
50%,5000.5,60024.0,5.0,2.0,2.0,2.5745,2899.5,613495.5
75%,7500.25,60037.0,7.0,3.0,2.0,3.771,4340.25,905456.5
max,10000.0,60049.0,9.0,7.0,3.0,4.999,5952.0,1270773.0


In [None]:
df_submit.describe()

Unnamed: 0,id,zip,bedrooms,bathrooms,garage,land,sqft
count,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0
mean,11000.5,60024.7895,5.0435,2.308,1.4455,2.653507,2983.922
std,577.494589,14.617798,2.634403,1.640269,0.970825,1.355154,1581.491528
min,10001.0,60000.0,1.0,1.0,0.0,0.26,650.0
25%,10500.75,60012.0,3.0,1.0,1.0,1.47725,1668.0
50%,11000.5,60025.0,5.0,2.0,2.0,2.678,2890.0
75%,11500.25,60038.0,7.0,3.0,2.0,3.798,4388.25
max,12000.0,60049.0,9.0,7.0,3.0,4.999,5952.0


In [None]:
# check for misssing values
sum(df_train.isnull().sum())

0

In [None]:
df_zip.head()

Unnamed: 0,zip,median_income
0,60000,75806
1,60001,205564
2,60002,307019
3,60003,145929
4,60004,135496


Combine train and zip data

In [None]:
df_train = df_train.join(df_zip.set_index('zip'), on='zip')

In [None]:
df_submit = df_submit.join(df_zip.set_index('zip'), on='zip')

### Set-up data for the model 

In [None]:
# replace with  zscore 	
#df_train['land'] = zscore(df_train['land'])     - Not needed as the mean and sd are in single digit 
df_train['sqft'] = zscore(df_train['sqft'])
df_train['median_income'] = zscore(df_train['median_income'])

# df_test['land'] = zscore(df_test['land'])        - Not needed as the mean and sd are in single digit 
df_submit['sqft'] = zscore(df_submit['sqft'])
df_submit['median_income'] = zscore(df_submit['median_income'])

In [None]:
df_train.head()

Unnamed: 0,id,zip,bedrooms,bathrooms,garage,land,sqft,price,median_income
0,1,60019,9,2,3,2.198,1.224399,1005580,-0.820375
1,2,60049,5,2,2,4.517,-0.059003,620278,0.095794
2,3,60011,2,1,0,4.12,-1.12313,265711,-0.117203
3,4,60027,6,4,2,3.201,0.547227,819916,1.599117
4,5,60001,9,3,2,1.347,1.354029,1039491,-0.163032


In [None]:
# convert to numpy
x_columns = df_train.columns.drop(['id','zip', 'price'])

In [None]:
x_columns
x = df_train[x_columns].values
y = df_train['price'].values

In [None]:
print(x, y)

[[ 9.          2.          3.          2.198       1.22439894 -0.82037489]
 [ 5.          2.          2.          4.517      -0.05900261  0.09579356]
 [ 2.          1.          0.          4.12       -1.12312953 -0.11720343]
 ...
 [ 7.          2.          2.          2.011       0.63364727 -0.65511305]
 [ 9.          7.          3.          1.768       1.88286778 -0.71704911]
 [ 4.          1.          1.          2.377      -0.49045771 -1.52139269]] [1005580  620278  265711 ...  813447 1208882  446179]


### Create Model 

In [None]:
# create a folder to save the mode 
save_dir = "model_checkpoints"
!mkdir $save_dir

In [None]:
def create_model(input_dim, output_dim = 1):
    # Build the model
    model = Sequential()
    model.add(Dense(100, input_dim = input_dim, activation='relu', kernel_initializer='random_normal'))   # Hidden 1
    #model.add(Dropout(0.5))
    model.add(Dense(50, activation='relu', kernel_initializer = 'random_normal' ))                               # Hidden 2
    #model.add(Dropout(0.5))
    model.add(Dense(25, activation='relu', kernel_initializer= 'random_normal'))                                # Hidden 3
    model.add(Dense(output_dim))

    model.compile(loss = 'mean_squared_error', optimizer='adam')
    return model


In [None]:
# Use K Fold Classification 
kf = KFold(5 , shuffle=True, random_state=42)

# out of sample list (oos)
oos_y = []
oos_pred = []

fold = 0 

for train, test in kf.split(x):
    fold += 1
    print(f"Fold #{fold}")
    x_train = x[train]
    y_train = y[train]
    x_test = x[test]
    y_test = y[test]


    # Build Model 
    model = create_model(x_train.shape[1])

    monitor = EarlyStopping(monitor='val_loss', min_delta = 1e-3, 
                            patience=5, verbose = 1, mode = 'auto', 
                            restore_best_weights=True)
    
    # save model CheckPoints 
    checkpoint = ModelCheckpoint(
        f"{save_dir}/{str(fold)}/",
        save_weights_only = True,
        monitor ="val_loss",
        verbose =0,
        save_best_only =True,
        mode = 'min'
    )
    
    model.fit(x_train, y_train, validation_data=(x_test, y_test), \
              verbose =0, callbacks = [monitor, checkpoint], epochs= 1000 )
    # Predict Root Mean Square Error 
    pred = model.predict(x_test)

    oos_y.append(y_test)
    oos_pred.append(pred)

    # Measure MSE error
    score = metrics.mean_squared_error(y_test, pred)
    rscore = np.sqrt(score)
    print("Fold Score (MSE) : {}".format(score))
    print(f"Fold Score (RMSE): {rscore}")


Fold #1
Restoring model weights from the end of the best epoch.
Epoch 00052: early stopping
Fold Score (MSE) : 23795437.651808105
Fold Score (RMSE): 4878.0567495477235
Fold #2
Restoring model weights from the end of the best epoch.
Epoch 00043: early stopping
Fold Score (MSE) : 24417560.216883942
Fold Score (RMSE): 4941.412775399758
Fold #3
Restoring model weights from the end of the best epoch.
Epoch 00048: early stopping
Fold Score (MSE) : 24285554.71957126
Fold Score (RMSE): 4928.037613449319
Fold #4
Restoring model weights from the end of the best epoch.
Epoch 00068: early stopping
Fold Score (MSE) : 22712087.881425537
Fold Score (RMSE): 4765.720080053542
Fold #5
Restoring model weights from the end of the best epoch.
Epoch 00045: early stopping
Fold Score (MSE) : 23859105.285986327
Fold Score (RMSE): 4884.57831199238


In [None]:
# Build the oos prediction list and calculate the error.
oos_y = np.concatenate(oos_y)
oos_pred = np.concatenate(oos_pred)
score = np.sqrt(metrics.mean_squared_error(oos_pred, oos_y))
print(f"Final out of sample score (RMSE): {score}")

# write cross validation predcition


Final out of sample score (RMSE): 4878.6087667627735


### File for Submission 

In [None]:
# File to submit 
df_final = df_submit.copy()

In [None]:
df_submit.head(2)

Unnamed: 0,id,zip,bedrooms,bathrooms,garage,land,sqft,median_income
0,10001,60027,8,6,2,2.901,1.42501,1.607912
1,10002,60026,7,2,2,2.455,0.597736,-0.783911


In [None]:
x_submit = df_submit[x_columns].values

In [None]:
x_columns

Index(['bedrooms', 'bathrooms', 'garage', 'land', 'sqft', 'median_income'], dtype='object')

In [None]:
x_submit.shape

(2000, 6)

In [None]:
x_submit[0:2]

array([[ 8.        ,  6.        ,  2.        ,  2.901     ,  1.42501015,
         1.60791168],
       [ 7.        ,  2.        ,  2.        ,  2.455     ,  0.59773596,
        -0.78391149]])

In [None]:
# Read the weights from the saved model
fold = 0
model = create_model(x_submit.shape[1])

for fold in range(1, 6):
    fold_str =str(fold)
    model.load_weights(f"{save_dir}/{fold_str}/")
    df_final[f"pred_{fold_str}"] = model.predict(x_submit)

In [None]:
df_final.head(2)

Unnamed: 0,id,zip,bedrooms,bathrooms,garage,land,sqft,median_income,pred_1,pred_2,pred_3,pred_4,pred_5
0,10001,60027,8,6,2,2.901,1.42501,1.607912,1096276.0,1096121.125,1096746.125,1096356.0,1096251.875
1,10002,60026,7,2,2,2.455,0.597736,-0.783911,805389.8,805895.5,805652.875,805645.2,805296.875


In [None]:
preds = []

for fold in range (1, 6):
    preds.append(df_final[f"pred_{str(fold)}"])

df_final['price'] = pd.concat(preds, axis = 1).mean(axis=1).astype(int)

In [None]:
df_final.head(2)

Unnamed: 0,id,zip,bedrooms,bathrooms,garage,land,sqft,median_income,pred_1,pred_2,pred_3,pred_4,pred_5,price
0,10001,60027,8,6,2,2.901,1.42501,1.607912,1096276.0,1096121.125,1096746.125,1096356.0,1096251.875,1096350
1,10002,60026,7,2,2,2.455,0.597736,-0.783911,805389.8,805895.5,805652.875,805645.2,805296.875,805576


In [None]:
df_kaggle = df_final[['id', 'price']]

In [None]:
df_kaggle

Unnamed: 0,id,price
0,10001,1096350
1,10002,805576
2,10003,288309
3,10004,236756
4,10005,360988
...,...,...
1995,11996,265756
1996,11997,1154614
1997,11998,1098463
1998,11999,503582


In [None]:
preds

[0       1.096276e+06
 1       8.053898e+05
 2       2.885076e+05
 3       2.370235e+05
 4       3.609449e+05
             ...     
 1995    2.658992e+05
 1996    1.154326e+06
 1997    1.098365e+06
 1998    5.034201e+05
 1999    1.120024e+06
 Name: pred_1, Length: 2000, dtype: float32, 0       1.096121e+06
 1       8.058955e+05
 2       2.878933e+05
 3       2.367739e+05
 4       3.613138e+05
             ...     
 1995    2.655965e+05
 1996    1.154898e+06
 1997    1.098476e+06
 1998    5.039302e+05
 1999    1.119818e+06
 Name: pred_2, Length: 2000, dtype: float32, 0       1.096746e+06
 1       8.056529e+05
 2       2.884523e+05
 3       2.366063e+05
 4       3.609432e+05
             ...     
 1995    2.657881e+05
 1996    1.154937e+06
 1997    1.098624e+06
 1998    5.037107e+05
 1999    1.120544e+06
 Name: pred_3, Length: 2000, dtype: float32, 0       1.096356e+06
 1       8.056452e+05
 2       2.888354e+05
 3       2.367532e+05
 4       3.610814e+05
             ...     
 1995    2

In [None]:
#pred_submit

array([[1096338.8 ],
       [ 805577.94],
       [ 288515.38],
       ...,
       [1098197.8 ],
       [ 503683.06],
       [1120096.1 ]], dtype=float32)

In [None]:
#df_final['price'] = pd.DataFrame(pred_submit).astype(int)

In [None]:
#df_final.drop(['zip', 'bedrooms', 'bathrooms', 'garage', 'land', 'sqft',  'median_income'], axis=1)

Unnamed: 0,id,price
0,10001,1096338
1,10002,805577
2,10003,288515
3,10004,236831
4,10005,360995
...,...,...
1995,11996,265977
1996,11997,1154405
1997,11998,1098197
1998,11999,503683
