## Table of Contents
- Overview
- Import Packages
- Import Datasets
- Exploratory Data Analysis
- Data Preprocessing
- Model Development
- Model Evaluation
- Conclusion

## Overview
In this notebook I will use dataset House Sales in King County, USA to build a House Price Predictor. First I will import packages and import datasets, then I will do Exploratory Data Analysis and Data Preprocessing base on it, later I will build a deep and wide Model using TensorFlow Feature Columns and DenseFeatures, then I will train this Model, finally I will evaluate this Model.

## Import Packages

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf 
from tensorflow import feature_column

## Import Datasets

In [None]:
data = pd.read_csv("/kaggle/input/housesalesprediction/kc_house_data.csv")

## Exploratory Data Analysis

Now show first 5 rows and statistics infomation:

In [None]:
data.head()

Show statistics info:

In [None]:
data.describe().transpose()

**Compute Correlation score**

Let's Compute pairwise correlation of columns and see what's the most correlated features of price feature.

In [None]:
data.corr()

In [None]:
data.corr()["price"].sort_values(ascending=False)

Let's see type of different features. As we can see date column is object type, so Id and zipcode doesn't have relation to price of house, so we will remove these fields. 
Most of the feature are numeriacal features. However, we need to be noticed of following:
- Date can indicate year, month, day information. We should cacluate how old the house is combined with yr_built and how many years since renovated combined with yr_renovated.
- Id and zipcode is not corelated with price so we won't use them to predict house prices. 
- View, waterfront, condition, grade seems like a quantity but it's better to be treated as a category.
- lat and long column is quantity, at the same time combining them can get a location information.


In [None]:
data.info()

## Data Preprocessing
We need to preprocess datasets in following ways:
- Extract year information from date column
- Caculate how long has it been since houses were built and renovated 
- Remove unnecessary columns
- Create TensorFlow Feature Columns for Modeling
- Train test split

**Extract year information from date column**

In [None]:
data["year"] = data["date"].apply(lambda date: int(date[0:4]))

**Caculate how long has it been since houses were built and renovated**

In [None]:
data["years_since_built"] = data["year"] - data["yr_built"]
data["years_since_renovated"] = data["year"] - data["yr_renovated"]

**Remove unnecessary columns**

In [None]:
unnecessary_column_names = ["id", "date", "zipcode", "yr_built", "yr_renovated", "year"]
for column_name in unnecessary_column_names:
    data.pop(column_name)

In [None]:
data.describe().transpose()

Let's calcuate correlation scores with price again:

In [None]:
data.corr()["price"].sort_values(ascending=False)

### Create TensorFlow Feature Columns

Create numerical columns:

In [None]:
numerical_colunmn_names = [
    'bedrooms',
    'bathrooms',
    'sqft_living',
    'sqft_lot',
    'floors',
    'sqft_above',
    "sqft_basement",
    "sqft_living15",
    "sqft_lot15",
    "years_since_built",
    "years_since_renovated",
    "long",
    "lat"
]
numerical_colunmns = [feature_column.numeric_column(name, dtype=float) for name in numerical_colunmn_names]

In [None]:
for column in numerical_colunmn_names:
    data[column] = data[column].astype(float)

Create categorical columns:

In [None]:
categorical_column_names = ["waterfront", "condition", "grade", "view"]
categorical_column_lists = [sorted(data[item].unique()) for item in categorical_column_names]
categorical_columns = [feature_column.indicator_column(feature_column.categorical_column_with_vocabulary_list(name, category)) for (name,category) in zip(categorical_column_names, categorical_column_lists)]

Create a crossed column about location combined with latitude and longitude:

In [None]:
min_lat, max_lat = data["lat"].min(), data["lat"].max()
min_long, max_long = data["long"].min(), data["long"].max()
print(min_lat, max_lat, min_long, min_long)
num_buckets = 8
latbuckets = np.linspace(start=min_lat, stop=max_lat, num=num_buckets).tolist()
lonbuckets = np.linspace(start=min_long, stop=max_long, num=num_buckets).tolist()
print(latbuckets, lonbuckets)
lat_column = feature_column.bucketized_column(
    source_column=feature_column.numeric_column("lat"), boundaries=latbuckets)
long_column = feature_column.bucketized_column(
    source_column=feature_column.numeric_column("long"), boundaries=lonbuckets)
location_column = feature_column.crossed_column(
    [lat_column, long_column], 
    hash_bucket_size=num_buckets * num_buckets
)
location_embedding_column = feature_column.embedding_column(categorical_column=location_column, dimension=3)

In [None]:
wide_columns = [
    feature_column.indicator_column(location_column)
] + categorical_columns

deep_columns = [location_embedding_column] + numerical_colunmns

In [None]:
inputs = dict()
for item in numerical_colunmns:
    inputs[item.key] = tf.keras.layers.Input(name=item.key, shape=(), dtype="float32")
for item in categorical_columns:
    inputs[item.categorical_column.key] = tf.keras.layers.Input(name=item.categorical_column.key, shape=(), dtype="int32")

In [None]:
inputs

**Train test split**

In [None]:
from sklearn.model_selection import train_test_split
data_train, data_test = train_test_split(data, test_size=0.2, random_state=997)
data_train.to_csv("data_train.csv",index=False)
data_test.to_csv("data_test.csv",index=False)

**Create TensorFlow Dataset**

In [None]:
def features_and_labels(row_data):
    label = row_data.pop("price")
    features = row_data
    return features, label

def create_dataset(pattern, epochs=1, batch_size=32, mode='eval'):
    dataset = tf.data.experimental.make_csv_dataset(
        pattern, batch_size
    )
    dataset = dataset.map(features_and_labels)
    if mode == 'train':
        dataset = dataset.shuffle(buffer_size=1000).repeat(epochs)
    dataset = dataset.prefetch(1)
    return dataset

In [None]:
batch_size = 100
train_data = create_dataset("data_train.csv", batch_size=batch_size, mode='train')
test_data = create_dataset("data_test.csv", batch_size=batch_size, mode='eval').take(data_test.shape[0] // batch_size)

## Model Development
Create a wide and deep Model using 2 DenseFeatures layers. One is deep layer to fit numerical data, another is wide layer to fit sparse and categorical data.

In [None]:
def build_model():
    deep = tf.keras.layers.DenseFeatures(deep_columns, name='deep_inputs')(inputs)
    deep = tf.keras.layers.Dense(32, activation='relu')(deep)
    deep = tf.keras.layers.Dense(32, activation='relu')(deep)
    deep = tf.keras.layers.Dense(32, activation='relu')(deep)
    wide = tf.keras.layers.DenseFeatures(wide_columns, name='wide_inputs')(inputs)
    wide = tf.keras.layers.Dense(64, activation='relu')(wide)
    combined = tf.keras.layers.concatenate(inputs=[deep, wide], name='combined')
    output = tf.keras.layers.Dense(1)(combined)
    model = tf.keras.Model(inputs=list(inputs.values()), outputs=output)
    model.compile(optimizer="adam", loss="mape", metrics=["mse", "mae", "mape"])
    return model

In [None]:
model = build_model()

**Plot the Model**

In [None]:
tf.keras.utils.plot_model(model, show_shapes=False, rankdir='LR')

Let's train the Model for 400 epochs. Add an EarlyStopping layer so that it will stop after the Model stop imporving.

In [None]:
epochs = 400
early_stop = tf.keras.callbacks.EarlyStopping(patience=10)
steps_per_epoch = data_train.shape[0] // batch_size
history = model.fit(
    train_data, 
    steps_per_epoch=steps_per_epoch,
    validation_data=test_data,
    epochs=epochs,
    callbacks=[early_stop],
    verbose=2
)

## Model Evaluation

**Loss (Mean Squared Error) over time**

In [None]:
pd.DataFrame(history.history, columns=["loss", "val_loss"]).plot()

**Mean Average Error over time**

It means that Mean Average Error of house prices this Model predict is about 100000 dollars.

In [None]:
pd.DataFrame(history.history, columns=["mae", "val_mae"]).plot()

**Mean Average Percentage Error over time**

In [None]:
pd.DataFrame(history.history, columns=["mape", "val_mape"]).plot()

## Conclusion 
- The MAPE score of this Model is about 17%, it means that mean error this Model predicts are 17% of the acutal house prices.
- The MAE score of this Model is about 106086.4609, which means that mean error this model predicts are 106086 dollars, which is still a sinificatn amount.
- MAP / MAPE / MSE curves of this Model are very similar, so this Model does not overfit.
- The most important features that can impact a house's prices are: Square footage of the home, overall grade given to the housing unit, Square footage of house apart from basement, Living room area in 2015, Number of bathrooms, whether it has been viewed, Square footage of the basement, Number of bedrooms, Latitude coordinate, whether house has a view to a waterfront.