<a href="https://colab.research.google.com/github/raffieeey/MiscStuff/blob/master/Ames_housing_wide_and_deep.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
!pip install tf-nightly-gpu-2.0-preview

## Ames Housing Price Regression with a Wide and Deep Model

This example includes:

- Parsing and vectorization of a structured dataset containing both numerical features and categorical features.
- Handling missing values.
- Data normalization.
- Building and training a wide-and-deep model for regression.
- Decoding model predictions.

## Download and unzip the data

In [0]:
!wget https://s3.amazonaws.com/keras-datasets/ames-housing-dataset.zip

--2019-06-03 18:25:55--  https://s3.amazonaws.com/keras-datasets/ames-housing-dataset.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.21.181
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.21.181|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 182871 (179K) [application/zip]
Saving to: ‘ames-housing-dataset.zip’


2019-06-03 18:25:56 (2.07 MB/s) - ‘ames-housing-dataset.zip’ saved [182871/182871]



In [0]:
!unzip ames-housing-dataset.zip

Archive:  ames-housing-dataset.zip
  inflating: AmesHousing.csv         


## Load the data in Pandas and list the columns

In [0]:
import pandas as pd
import numpy as np

dataframe = pd.read_csv('AmesHousing.csv')

In [0]:
print('Columns:', list(enumerate(dataframe.columns)))

Columns: [(0, 'Order'), (1, 'PID'), (2, 'MS SubClass'), (3, 'MS Zoning'), (4, 'Lot Frontage'), (5, 'Lot Area'), (6, 'Street'), (7, 'Alley'), (8, 'Lot Shape'), (9, 'Land Contour'), (10, 'Utilities'), (11, 'Lot Config'), (12, 'Land Slope'), (13, 'Neighborhood'), (14, 'Condition 1'), (15, 'Condition 2'), (16, 'Bldg Type'), (17, 'House Style'), (18, 'Overall Qual'), (19, 'Overall Cond'), (20, 'Year Built'), (21, 'Year Remod/Add'), (22, 'Roof Style'), (23, 'Roof Matl'), (24, 'Exterior 1st'), (25, 'Exterior 2nd'), (26, 'Mas Vnr Type'), (27, 'Mas Vnr Area'), (28, 'Exter Qual'), (29, 'Exter Cond'), (30, 'Foundation'), (31, 'Bsmt Qual'), (32, 'Bsmt Cond'), (33, 'Bsmt Exposure'), (34, 'BsmtFin Type 1'), (35, 'BsmtFin SF 1'), (36, 'BsmtFin Type 2'), (37, 'BsmtFin SF 2'), (38, 'Bsmt Unf SF'), (39, 'Total Bsmt SF'), (40, 'Heating'), (41, 'Heating QC'), (42, 'Central Air'), (43, 'Electrical'), (44, '1st Flr SF'), (45, '2nd Flr SF'), (46, 'Low Qual Fin SF'), (47, 'Gr Liv Area'), (48, 'Bsmt Full Bath'

Columns "Order" and "PID" are not features and need to be filtered out. Column "SalePrice" contains the target we need to predict for each sample.

## Vectorize targets

In [0]:
targets = dataframe['SalePrice'].values.astype('float32')
print('targets.shape:', targets.shape)

targets.shape: (2930,)


## Vectorize features

There are two kinds of features in this dataset:

- Numerical features: a number associated with each sample. For instance, "Lot Area" is a numerical feature.
- Categorical features: a string value associated with each sample. For instance, "MS Zoning" is a categorical feature.

In [0]:
dataframe['Lot Area'][:5]

0    31770
1    11622
2    14267
3    11160
4    13830
Name: Lot Area, dtype: int64

In [0]:
dataframe['MS Zoning'][:5]

0    RL
1    RH
2    RL
3    RL
4    RL
Name: MS Zoning, dtype: object

We vectorize each numerical feature by using Panda's `values` accessor, and casting the resulting Numpy array to float32.

We vectorize each categorical feature by:

- Making a list of the different possible values taken by this feature, and how many times each value occurs.
- Filtering out values with fewer than 50 occurences.
- One-hot encoding the remaining values.

One-hot encoding is done by assigning an index to each value, then encoding the feature for sample as an array that is all-zeros except for the index of the value taken by this sample.

For instance, if you have 3 possible vales, "val1", "val2", and "val3", we could assign the following indices:

`{'val1': 0, 'val2': 1, 'val3': 2}`

then a sample with value "val2" would be encoded as:

`[0, 1, 0]`

and a sample with value "val3" would be encoded as:

`[0, 0, 1]`

If a sample has an empty entry for this feature, or has a value that was filtered out (i.e. not one of `{'val1', 'val2', 'val3'}`), we would encode it as:

`[0, 0, 0]`.

In [0]:
SKIP_VALUES_WITH_FEWER_OCCURENCES_THAN = 50

def vectorize_column(column):
  print('Processing column: "%s"' % column.name)
  dtype = column.dtype
  if np.issubdtype(dtype, np.number):
    print('...numerical column')
    array = vectorize_numerical_column(column)
  else:
    print('...categorical column')
    array = vectorize_categorical_column(column)
  print('...shape:', array.shape)
  print('-')
  return array

  
def vectorize_numerical_column(column):
  return np.expand_dims(column.values, -1).astype('float32')


def vectorize_categorical_column(column):
  print('...value counts:')
  print(column.value_counts())
  
  # Filter values that occur at least SKIP_VALUES_WITH_FEWER_OCCURENCES_THAN
  # and assign an integer index to each of these values (for one-hot encoding).
  value_index = {}
  for value, count in column.value_counts().items():
    if count >= SKIP_VALUES_WITH_FEWER_OCCURENCES_THAN:
      value_index[value] = len(value_index)
  # Note that our `value_index` 
  print('...kept values:', list(value_index.keys()))
  
  # One-hot encode the filtered values.
  array = np.zeros(shape=(len(column), len(value_index)), dtype='float32')
  for i, value in enumerate(column):
    if value in value_index:
      index = value_index[value]
      array[i, index] = 1.
  return array

In [0]:
column_arrays = []
# We exclude column 0 and 1 (not informative) and the last column (targets)
for name in dataframe.columns[2:-1]:
  array = vectorize_column(dataframe[name])
  column_arrays.append(array)

features = np.concatenate(column_arrays, axis=-1)
print('features shape: %s' % (features.shape,))

Processing column: "MS SubClass"
...numerical column
...shape: (2930, 1)
-
Processing column: "MS Zoning"
...categorical column
...value counts:
RL         2273
RM          462
FV          139
RH           27
C (all)      25
I (all)       2
A (agr)       2
Name: MS Zoning, dtype: int64
...kept values: ['RL', 'RM', 'FV']
...shape: (2930, 3)
-
Processing column: "Lot Frontage"
...numerical column
...shape: (2930, 1)
-
Processing column: "Lot Area"
...numerical column
...shape: (2930, 1)
-
Processing column: "Street"
...categorical column
...value counts:
Pave    2918
Grvl      12
Name: Street, dtype: int64
...kept values: ['Pave']
...shape: (2930, 1)
-
Processing column: "Alley"
...categorical column
...value counts:
Grvl    120
Pave     78
Name: Alley, dtype: int64
...kept values: ['Grvl', 'Pave']
...shape: (2930, 2)
-
Processing column: "Lot Shape"
...categorical column
...value counts:
Reg    1859
IR1     979
IR2      76
IR3      16
Name: Lot Shape, dtype: int64
...kept values: ['Reg'

## Split data into a training set and a validation set

In [0]:
num_val_samples = int(len(features) * 0.2)

train_features = features[:-num_val_samples]
train_targets = targets[:-num_val_samples]

val_features = features[-num_val_samples:]
val_targets = targets[-num_val_samples:]

In [0]:
print('Number of training samples:', len(train_features))
print('Number of validation samples:', len(val_features))

Number of training samples: 2344
Number of validation samples: 586


## Replace missing values with mean of column

Some numerical columns apparently contain NaN values (corresponding to missing entries):

In [0]:
nan_indices = np.where(np.isnan(train_features))
print('Number of NaN entries in train_features:', len(np.unique(nan_indices)))

Number of NaN entries in train_features: 547


We replace them with the mean of their column. We do this using the handy Numpy utility `np.nanmean`., to compute the mean of the columns while excluding NaN values (otherwise the mean itself would be NaN).

In [0]:
features_mean = np.nanmean(train_features, axis=0)

nan_indices = np.where(np.isnan(train_features))
train_features[nan_indices] = np.take(features_mean, nan_indices[1])

nan_indices = np.where(np.isnan(val_features))
val_features[nan_indices] = np.take(features_mean, nan_indices[1])

## Normalize feature values

Some numerical features take laerge integer values, so we should normalize them to a range more amenable to a neural network.

In [0]:
train_features -= features_mean
val_features -= features_mean

features_std = np.std(train_features, axis=0)
train_features /= features_std
val_features /= features_std

## Normalize target values

Since our targets also take very large integer values (dollar prices for houses), we should normalize them to a range more amenable to a neural network.

Here, we chose median-normalization (divide by the median, without previously subtracting the mean): a price encoded as "1.5" means "50% higher than the median price", a price encoded as "0.5" is "50% of the median price", etc. This has the advantage of being easily interpretable by humans.

In [0]:
targets_median = np.median(train_targets, axis=0)
train_targets /= targets_median
val_targets /= targets_median

## Train a wide-and-deep network

A wide and deep network consists of a deep network  (stack of `Dense` layers) and a wide network (single `Dense` layer) trained end to end. The final output is the sum of the predictions of each network. Often, the wide part and the deep part taken different features as inputs -- here, for simplicity, we feed all features to both networks.

Since this is a regression problem, we don't apply any activation in our final layers, so as not to artificially constrain the range taken by the predictions.

In [0]:
from tensorflow import keras
from tensorflow.keras import layers

inputs = keras.Input(shape=features.shape[1:])
x = layers.Dense(128, activation='relu')(inputs)
x = layers.Dropout(0.5)(x)
x = layers.Dense(128, activation='relu')(x)
x = layers.Dropout(0.5)(x)
x = layers.Dense(128, activation='relu')(x)
x = layers.Dropout(0.5)(x)
deep_output = layers.Dense(1)(x)

wide_output = layers.Dense(1)(inputs)
total_output = layers.add([deep_output, wide_output])

model = keras.Model(inputs, total_output, name='wide_and_deep')
model.summary()

Model: "wide_and_deep"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 183)]        0                                            
__________________________________________________________________________________________________
dense (Dense)                   (None, 128)          23552       input_1[0][0]                    
__________________________________________________________________________________________________
dropout (Dropout)               (None, 128)          0           dense[0][0]                      
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 128)          16512       dropout[0][0]                    
______________________________________________________________________________________

Since this is a regression problem, we `MeanSquareError` (`mse`) as our loss. We also monitor `mean_absolute_percentage_error`, since this is a human-interpretable way to track performance on a price regression problem.

In [0]:
model.compile(optimizer='adam',
              loss='mse',
              metrics=['mean_absolute_percentage_error'])

callbacks = [keras.callbacks.ModelCheckpoint(
    'ames_regression_model_at_epoch_{epoch}.h5')]

model.fit(train_features, train_targets,
          batch_size=512,
          epochs=500,
          verbose=2,
          callbacks=callbacks,
          validation_data=(val_features, val_targets))

Train on 2344 samples, validate on 586 samples
Epoch 1/500
2344/2344 - 1s - loss: 4.7667 - mean_absolute_percentage_error: 180.2890 - val_loss: 2.0175 - val_mean_absolute_percentage_error: 125.3691
Epoch 2/500
2344/2344 - 0s - loss: 2.9648 - mean_absolute_percentage_error: 143.3357 - val_loss: 1.4821 - val_mean_absolute_percentage_error: 105.5913
Epoch 3/500
2344/2344 - 0s - loss: 2.4766 - mean_absolute_percentage_error: 127.7989 - val_loss: 1.3212 - val_mean_absolute_percentage_error: 99.1847
Epoch 4/500
2344/2344 - 0s - loss: 2.0363 - mean_absolute_percentage_error: 117.5641 - val_loss: 1.2346 - val_mean_absolute_percentage_error: 95.5825
Epoch 5/500
2344/2344 - 0s - loss: 1.7519 - mean_absolute_percentage_error: 112.0753 - val_loss: 1.1310 - val_mean_absolute_percentage_error: 91.4418
Epoch 6/500
2344/2344 - 0s - loss: 1.6732 - mean_absolute_percentage_error: 106.6928 - val_loss: 1.0821 - val_mean_absolute_percentage_error: 88.8964
Epoch 7/500
2344/2344 - 0s - loss: 1.4949 - mean_ab

<tensorflow.python.keras.callbacks.History at 0x7fcb065bda20>

After 500 epochs, our predictions are off by less than 10% on average. This is pretty good.

## Decoding predictions

Because our targets were median-normalized, we need to multiply them back by the median value to get the dollar price.

In [0]:
preds = model.predict(val_features[:10])
preds *= targets_median
preds = [int(p) for p in preds]
print('Price predictions for first 10 samples:', preds)
print('Actual prices:', [int(p) for p in val_targets[:10] * targets_median])

Price predictions for first 10 samples: [221236, 166130, 163822, 214029, 171229, 200792, 248598, 160878, 172839, 258069]
Actual prices: [235000, 190000, 169000, 241500, 188899, 207500, 250000, 179900, 172000, 294323]
