# Regression With Tensorflow (House Prices)
Let's predict some house prices

In [None]:
from __future__ import absolute_import, division, print_function

%matplotlib inline

import pathlib
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import datetime

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass

import tensorflow as tf
from tensorflow import keras

## Step 1 : Read Data

In [None]:
data_location = '../data/house-prices/house-sales-full.csv'
# data_location = 'https://elephantscale-public.s3.amazonaws.com/data/house-prices/house-sales-full.csv'

house_prices = pd.read_csv(data_location)
house_prices

## Step 2 : Cleanup Data

In [None]:
print("original row count : ", house_prices.shape)
house_prices = house_prices.dropna()
print ("cleaned up row count : ", house_prices.shape)

## Step 3 : Exploratory Data Analysis (EDA)
EDA will give us a sense of data.  It is highly recommended that you do this before learning.

**==> Q : What is max number of bedrooms? :-)**

In [None]:
## get a summary of data
pd.options.display.float_format = '{:,.2f}'.format

## TODO : use 'describe()' function to get summary info
house_prices.???().T

## Step 4: Remove Outliers
As you can see we have a few outliers.  
Let's remove them by considering only houses with less than 5 BR

In [None]:
## TODO : commented out for now, 
##        uncomment during tuning phase

# house_prices = house_prices[house_prices['Bedrooms'] <= 5]
# house_prices

## Step 5 : Choose Columns to consider
Which attributes do you think are important in deciding SalePrice?

In [None]:
## TODO : Experiment with this, 
## select columns you think are important in determining SalePrice
## Hint : Start with : 'Bedrooms', 'Bathrooms', 'SqFtTotLiving', 'SqFtLot'
input_columns = ['???', '???', '???', '???']
label_column = 'SalePrice'
# x = house_prices.loc[:, input_columns]
x = house_prices [input_columns]
y = house_prices[[label_column]]

print(x.head())
print ('--------')
print (y.head())

## Step 6 :  Split data into train /test

In [None]:
from sklearn.model_selection import train_test_split

## TODO split train/test = 80% / 20%
## Hint : test_size=0.2  (representing 20%)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = ???, random_state = 0)

x_train_orig = x_train
x_test_orig = x_test

print ("x_train.shape : ", x_train.shape)
print ("y_train.shape : ", y_train.shape)
print ("x_test.shape : ", x_test.shape)
print ("y_test.shape : ", y_test.shape)

## Step 7 : Scale Data

In [None]:
## To turn off scaling, comment this cell out

def my_scaler(df):
    #return (df-df.min())/(df.max()-df.min())  ## this is min/max scaler
    return (df - df.mean()) / df.std()

print ("x_train: before and after")
print(x_train_orig.head())
x_train = my_scaler(x_train_orig)
print(x_train.head())

print ('-----')
print ('x_test: before / after')
print (x_test_orig.head())
x_test = my_scaler (x_test_orig)
print (x_test.head())

## Step 8:  Build a Model

Build a 3 layer network
- input (64 neurons)
- hidden (64 neurons)
- output (1 neuron)

In [None]:
def build_model():
    input_dim = len(x_train.keys())
    print ("input_dim : ", input_dim)
    ## TODO : define a a model
    ##   - add 64 neurons (units=64) for 'input_layer'  and 'hidden_1' layer
    ##   - final outupt layer has ONE neuron  (units=1)
    model = tf.keras.Sequential([
                tf.keras.layers.Dense(units=???, activation=tf.nn.relu, input_shape=[input_dim], name="input_layer"),
                tf.keras.layers.Dense(units=???, activation=tf.nn.relu, name="hidden_1"),
                tf.keras.layers.Dense(units=???, name="output_layer")
            ])

    ## TODO : We start with RMSProp.  Feel free to try other optimizers
    optimizer = tf.keras.optimizers.RMSprop(0.01)

    model.compile(loss='mean_squared_error',
                  optimizer=optimizer,
                  metrics=['mean_absolute_error', 'mean_squared_error'])

    return model



In [None]:
model = build_model()
print (model.summary())

## Step 9: Setup Tensorboard

In [None]:
## This is fairly boiler plate code

import datetime
import os

app_name = 'regression-house-prices' # you can change this, if you like

tb_top_level_dir= '/tmp/tensorboard-logs'
tensorboard_logs_dir= os.path.join (tb_top_level_dir, app_name, 
                                    datetime.datetime.now().strftime("%Y-%m-%d--%H-%M-%S"))
print ("Saving TB logs to : " , tensorboard_logs_dir)

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=tensorboard_logs_dir, histogram_freq=1)

# The patience parameter is the amount of epochs to check for improvement
early_stop = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)

## Step 10:  Train

In [None]:
%%time

## TODO start with 100, try 500 and 1000 
epochs = 100  ## experiment 100, 500, 1000

print ("training starting ...")
## TODO : to see training output set verbose=2
history = model.fit(
              x_train, y_train,
              epochs=epochs, validation_split = 0.2, verbose=0,
              callbacks=[early_stop, tensorboard_callback])

print ("training done.")

##TODO : how long is the training taking?

## Step 11: History

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.plot(history.history['mean_squared_error'], label='mse')
plt.plot(history.history['mean_absolute_error'], label='mae')
plt.legend()
plt.show()

## Step 12 : Evaluate Model

In [None]:
metric_names = model.metrics_names
print ("model metrics : " , metric_names)
metrics = model.evaluate(x_test, y_test, verbose=0)

for idx, metric in enumerate(metric_names):
    print ("Metric : {} = {:,.2f}".format (metric_names[idx], metrics[idx]))

## Step 13: Predict

In [None]:
predictions = model.predict(x_test)
print (x_test)
print(predictions)

## Step 14: Evalute prediction output
Let's do a pd dataframe and do some plots

In [None]:
predictions_df = pd.DataFrame(x_test_orig)  # use the original one, not scaled
predictions_df['actual_price'] = y_test
predictions_df['predicted_price'] = predictions
predictions_df['error'] = predictions_df['actual_price'] - predictions_df['predicted_price'] 

pd.options.display.float_format = '{:,.2f}'.format
## print sample to see different data every time
predictions_df.sample(frac=0.1)
## or just print the first few
# predictions_df

In [None]:
## which house we got really wrong?
print ("biggest error : ")
predictions_df.loc[predictions_df['error'].abs().idxmax()]


In [None]:
## which house we are spot on?
print ("lowest error")
predictions_df.loc[predictions_df['error'].abs().idxmin()]

### How many house sales, we predicted within 5% ?
Let's use 5% margin of error as our benchmark

In [None]:
predictions_df['error_percentage'] = predictions_df['error'].abs() * 100 / predictions_df['actual_price']
predictions_df

In [None]:
## TODO : you can adjust the benchmark target
benchmark = 5  # 5%

good_predictions = predictions_df[predictions_df['error_percentage'] <= benchmark]

good_predictions

In [None]:
meeting_benchmark = good_predictions.shape[0] *100 / predictions_df.shape[0]

print ("number of predictions within benchmark error ({}%) are  =  {:,}  ({:.1f}% of total)".
       format (benchmark, good_predictions.shape[0], meeting_benchmark))


## Step 15: Ideas to Try
Now that we had done an 'end-to-end' regression implementation, lets tune our algorithm.  

**==> Q : What are some fo the things we can do to get a higher performance?**

Here are some ideas to get you started
- **Idea 1 : Any other inputs we can add?**  
  - In Step 5, add a couple more columns as input
  - only choose numeric columns at this time
  - Try adding 'LandVal'  as an input column.  Run again, did that improve the benchmark performance?
  - What would be the implication of adding all the columns?
  
- **Idea 2 : Remove outliers**  
As you noticed, we have quite a bit of outliers (remember the 33 bedroom house? :-).  Outliers tend to skew the results.  So let's remove them
  - Step 4 : uncomment the cell.  Here we are filtering only houses that have less than 5 bedrooms
  
- **Idea 3 : Increase epochs**  
  - In Step 10, increase epochs from 100 to 500 to 1000
  - Notice the training time will increase
  - do you get better results?  why or why not?
  
- **Idea 4 : Build a Bigger network** 
  - In Step 8, we are setting up our network.  We are using 64 neurons
  - Increase the number of neurons from 64 to 128
  - Th
  - Does the training time go up?
  - Are you getting better accuracy?
  - We can also add more layers and build a 'deeper' network.  More on this later
  
- **Idea 5 : Need more data :-)**  
Most of the time, neural networks can yield better results if trained on more data

- **Any other ideas?**

#### Share your experiments with the class!

**What is the best score you have gotten? :-)**

## Final Step : Create the most compact code
In this notebook we walked you through multiple steps for learning purposes.  
Now we are asking you to come up with **bare minimum** code to implement this neural net.  

### Class Challenge :-)
- Let's see who can come up with most compact code (fewest lines)  
- Create a new notebook, and start from scratch
- Few hints
  - no prints
  - minimize comments
  - no debug / exploration
  
**Ready, set, go!**