# House Price Prediction using TFDF

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://www.tensorflow.org/decision_forests/tutorials/kaggle_beginner_example_regression"><img src="https://www.tensorflow.org/images/tf_logo_32px.png" />View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/vanshhhhh/Google-Summer-of-Code-2022-TensorFlow/blob/main/src/kaggle_beginner_example_regression.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/vanshhhhh/Google-Summer-of-Code-2022-TensorFlow/blob/main/src/kaggle_beginner_example_regression.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View on GitHub</a>
  </td>
  <td>
    <a href="https://storage.googleapis.com/tensorflow_docs/decision-forests/documentation/tutorials/kaggle_beginner_example_regression.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" />Download notebook</a>
  </td>
</table>

Kaggle dataset - [House Prices - Advanced Regression Techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques)

## Introduction

[TensorFlow Decision Forests](https://www.tensorflow.org/decision_forests)
is a collection of state-of-the-art algorithms of Decision Forest models
that are compatible with [Keras APIs](https://www.tensorflow.org/api_docs/python/tf/keras)
.
The models include [Random Forests](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/RandomForestModel),
[Gradient Boosted Trees](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/GradientBoostedTreesModel),
and [CART](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/CartModel),
and can be used for regression, classification, and ranking tasks.
For a beginner's guide to TensorFlow Decision Forests,
please refer to this [tutorial](https://www.tensorflow.org/decision_forests/tutorials/beginner_colab).

### Random Forest
Decision Forests are a family of tree-based models including Random Forests and Gradient Boosted Trees. They are the best place to start when working with tabular data, and will often outperform (or provide a strong baseline) before you begin experimenting with neural networks.

In this example we will use TensorFlow to train each of these on a dataset you load from a CSV file. This is a common pattern in practice. Roughly, your code will look as follows:

```
import tensorflow_decision_forests as tfdf
import pandas as pd
  
dataset = pd.read_csv("project/dataset.csv")
tf_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(dataset, label="my_label",  task=tfdf.keras.Task.REGRESSION)

model = tfdf.keras.RandomForestModel()
model.fit(tf_dataset)
  
print(model.summary())
```

### Setup

#### Install TensorFlow Decision Forests

There are many excellent libraries for working with tree-based models, including [scikit-learn](https://scikit-learn.org/) (highly recommended for all your ML needs), XGBoost, LightGBM, and others.

In this example we'll use [TensorFlow Decision Forests (TF-DF)](https://www.tensorflow.org/decision_forests), a relatively new library used to train large models. 

In [None]:
!pip install tensorflow_decision_forests --quiet

#### Import the library

In [None]:
# Scientific computing # 
import numpy as np     # Numpy Documentation -  https://numpy.org/doc/stable/ 

# -  Data processing - #
import pandas as pd    # Pandas Documentation - https://pandas.pydata.org/docs/

# -- Hide Warnings  -- #
import warnings
warnings.filterwarnings('ignore')

# ---- Tensorflow ---- #
import tensorflow as tf
import tensorflow_decision_forests as tfdf

In [None]:
print("TensorFlow v" + tf.__version__)
print("TensorFlow Decision Forests v" + tfdf.__version__)

TensorFlow v2.9.1
TensorFlow Decision Forests v0.2.7


### Download the House Prices dataset
[House Prices dataset](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) is an example of a regression problem in supervised learning. We have to predict home prices using regression method.

To run this notebook, you need to have a Kaggle account.

If you do not have an account, you can create one here: [Kaggle Register](https://www.kaggle.com/account/login?phase=startRegisterTab&returnUrl=%2F) 

In order to get a token to use in the following cell, check out the [Authentication Section](https://www.kaggle.com/docs/api#authentication) of Kaggle API documentation

In [None]:
#@title Enter your Kaggle token in order to fetch the dataset

username = '' #@param {type:"string"}
key = '' #@param {type: "string"}

In [None]:
#@title Configure Kaggle
try:
  from google.colab import files, drive

  # Install and Configure Kaggle
  import json

  token = {
    "username":username,
    "key":key
  }

  # Installing kaggle
  !pip install kaggle &> /dev/null

  # Creating .kaggle if necessary
  !if [ -d .kaggle ]; then echo ".kaggle exists"; else echo ".kaggle does not exist ... Creating it"; mkdir .kaggle; if [ -d .kaggle ]; then echo "Successfully created"; else echo "Error creating .kaggle"; fi; fi

  with open('/content/.kaggle/kaggle.json', 'w') as file:
      json.dump(token, file)

  # Creating .kaggle if necessary
  !if [ -d  ~/.kaggle ]; then echo " ~/.kaggle exists"; else echo " ~/.kaggle does not exist ... Creating it"; mkdir  ~/.kaggle; if [ -d  ~/.kaggle ]; then echo "Successfully created"; else echo "Error creating  ~/.kaggle"; fi; fi
  !cp /content/.kaggle/kaggle.json ~/.kaggle/kaggle.json

  # kaggle configuration
  !kaggle config set -n path -v{/content}

  # Changing mode
  !chmod 600 /root/.kaggle/kaggle.json
except Exception:
  pass

In [None]:
#@title Download Dataset
import os

DOWNLOAD_LOCATION = "/root/Downloads/"

if os.path.exists(DOWNLOAD_LOCATION):
    if os.path.isdir(DOWNLOAD_LOCATION):
        print("{} exists and is a directory".format(DOWNLOAD_LOCATION))
    else:
        print("{} exists but is not a directory!!!".format(DOWNLOAD_LOCATION))
else:
    print("{} does not exist ... Creating it".format(DOWNLOAD_LOCATION))
    os.makedirs(DOWNLOAD_LOCATION)

# Downloading
!kaggle competitions download -c house-prices-advanced-regression-techniques -p {DOWNLOAD_LOCATION}

# Extracting archives
!cd {DOWNLOAD_LOCATION}; unzip -qq \*.zip; rm -f *.zip

## Data Loading
Note: Pandas is practical as you don't have to type in name of the input features to load them. For larger datasets (>1M examples), using the TensorFlow Dataset to read the files may be better suited.

In [None]:
train_file_path = os.path.join(DOWNLOAD_LOCATION, "train.csv")
train_full_data = pd.read_csv(train_file_path)
print("Full train dataset shape is {}".format(train_full_data.shape))

Full train dataset shape is (1460, 81)


The data is composed of 81 columns and 1460 entries. We can see all 81 dimensions of our dataset by printing out the first 3 entries using the following code: 


In [None]:
train_full_data.head(3)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500



* 79 feature columns.
* Label column named `SalePrice`.
* We will drop `Id` column as it is not necessary for model training.

In [None]:
train_full_data = train_full_data.drop('Id', axis=1)

Let's print the updated table.

In [None]:
train_full_data.head(3)

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
1,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500
2,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500


To know more about the data description you can refer [Kaggle](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data).

## Prepare the dataset
This dataset contains a mix of numeric, categorical and missing features. TF-DF supports all these feature types natively, and no preprocessing is required. This is one advantage of tree-based models; making them a great entry point to tensorflow and ML.

Let's split the dataset into training and testing:

In [None]:
def split_dataset(dataset, test_ratio=0.10):
  test_indices = np.random.rand(len(dataset)) < test_ratio
  return dataset[~test_indices], dataset[test_indices]

train_ds_pd, val_ds_pd = split_dataset(train_full_data)
print("{} examples in training, {} examples in validation.".format(
    len(train_ds_pd), len(val_ds_pd)))

1308 examples in training, 152 examples in validation.


There's one more step required before you can train your model. You need to convert from Pandas format (`pd.DataFrame`) into TensorFlow format (`tf.data.Dataset`). A single line helper function that will do this for you: 

```
tfdf.keras.pd_dataframe_to_tf_dataset(your_df, label='your_label', task=tfdf.keras.Task.REGRESSION)
```

This is a high [performance](https://www.tensorflow.org/guide/data_performance) data loading library which is helpful when training neural networks with accelerators like [GPUs](https://cloud.google.com/gpu) and [TPUs](https://cloud.google.com/tpu). A GPU (Graphics Processing Unit) is a specialized processor with dedicated memory that conventionally perform floating point operations required for rendering graphics. GPUs are optimized for training artificial intelligence and deep learning models as they can process multiple computations simultaneously. It is not necessary for tree-based models until you begin to do distributed training.

Creating a fast input pipeline is important when working with neural networks, and forgetting to do so is the most common bug new researchers encounter. The author of this notebook has seen many folks with expensive GPUs that are idle ~50% of the time while waiting for data.

Note that tf.data is a bit tricky to use, and has a learning curve. There are guides on [tensorflow.org/guide](https://www.tensorflow.org/guide) to help.

In [None]:
label = 'SalePrice'
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(
           train_ds_pd, 
           label = label, 
           task = tfdf.keras.Task.REGRESSION)

val_ds = tfdf.keras.pd_dataframe_to_tf_dataset(
         val_ds_pd, 
         label = label, 
         task = tfdf.keras.Task.REGRESSION)

## Exploratory Data Analysis (EDA)
Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. It helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions. 

For this dataset, there are some amazing notebooks already available on Kaggle. One of them is [Detailed exploratory data analysis with python](https://www.kaggle.com/code/ekami66/detailed-exploratory-data-analysis-with-python) by Tuatini Godard.

## Create a Random Forest 

In [None]:
model = tfdf.keras.RandomForestModel(task = tfdf.keras.Task.REGRESSION)

## Train your model

This is a one-liner.

Note: You can safely ignore the warning about Autograph.

In [None]:
model.fit(x=train_ds)

## Visualize your model
One benefit of tree-based models is that you can easily visualize them. The default number of trees used in the Random Forest is 300. You can select a tree to display below.

In [None]:
tfdf.model_plotter.plot_model_in_colab(model, tree_idx=0, max_depth=3)

## Evaluate the model

In [None]:
import math     #built-in module that you can use for 
                #mathematical tasks (eg. square root)
                
evaluation = model.evaluate(val_ds, return_dict=True)

print(evaluation)
print(f"MSE: {evaluation['mse']:.2f}")
print(f"RMSE: {math.sqrt(evaluation['mse']):.2f}")

{'loss': 0.0, 'mse': 784483712.0}
MSE: 784483712.00
RMSE: 28008.64


# Test Set Prediction
Now we will do prediction on `test.csv`.


In [None]:
test_file_path = os.path.join(DOWNLOAD_LOCATION, "test.csv")
test_data = pd.read_csv(test_file_path)
ids = test_data.pop('Id')

In [None]:
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(
          test_data, 
          task = tfdf.keras.Task.REGRESSION)

In [None]:
preds = model.predict(test_ds)
output = pd.DataFrame({'Id': ids,
                       'SalePrice': preds.squeeze()})

output.head()



Unnamed: 0,Id,SalePrice
0,1461,125631.03125
1,1462,153911.578125
2,1463,179735.84375
3,1464,185844.453125
4,1465,196257.640625


You can download the predicted output as a CSV file and do submission on the [Competition page](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/submit) on Kaggle.

In [None]:
output_filename = "test_prediction_output.csv"
output.to_csv(output_filename, index=False)

In [None]:
from google.colab import files
files.download('test_prediction_output.csv')

# Try it out yourself
We've provided a bunch of code which you can use to explore the dataset, in case this is helpful to you in your future work. The code you need to write for this exercise is only a couple lines. 

Note: For this section the `test_ratio` is increased from 0.1 to 0.3. Therefore, you can get different result.


## Explore the dataset

In [None]:
train_file_path = os.path.join(DOWNLOAD_LOCATION, "train.csv")
train_full_data = pd.read_csv(train_file_path)
print("Full train dataset shape is {}".format(train_full_data.shape))

label="SalePrice"
classes = train_full_data[label].unique().tolist()
print(f"Label classes: {classes}")

train_full_data[label] = train_full_data[label].map(classes.index)

### Split the dataset

In [None]:
 def split_dataset(dataset, test_ratio=0.30):
    # YOUR CODE HERE

    
    # Add code to split the dataset
    return # your split data set

train_ds_pd, val_ds_pd = split_dataset(train_full_data)
print("{} examples in training, {} examples in validation.".format(
   len(train_ds_pd), len(val_ds_pd)))

In [None]:
#@title Solution
'''def split_dataset(dataset, test_ratio=0.30):
  test_indices = np.random.rand(len(dataset)) < test_ratio
  return dataset[~test_indices], dataset[test_indices]'''

## Create tf.data.Datasets from the Pandas DataFrame

In [None]:
# YOUR CODE HERE


# Add code to create a tf.data.Dataset for train and test from the DataFrames
# Example...
# train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(...
# test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(...

In [None]:
#@title Solution
#train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(
#           train_ds_pd, 
#           label = label, 
#           task = tfdf.keras.Task.REGRESSION)

#val_ds = tfdf.keras.pd_dataframe_to_tf_dataset(
#         val_ds_pd, 
#         label = label, 
#         task = tfdf.keras.Task.REGRESSION)

## Create your model

In [None]:
# YOUR CODE HERE


# Add code to create a random forest
# Example ...
# mymodel = tfdf.keras. ...

In [None]:
#@title Solution
#mymodel = tfdf.keras.RandomForestModel(task = tfdf.keras.Task.REGRESSION)

## Train your Model

In [None]:
# YOUR CODE HERE


# Add code to train your model
# Example ...
# mymodel.fit(...

In [None]:
#@title Solution
#mymodel.fit(x=train_ds)

## Evaluate your model
Uncomment these cells after completing the code above.

In [None]:
#evaluation = model.evaluate(val_ds, return_dict=True)
#print(evaluation)

In [None]:
#print(f"MSE: {evaluation['mse']:.2f}")
#print(f"RMSE: {math.sqrt(evaluation['mse']):.2f}")

# References
* Dive deep into 
    * [Random Forests](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/RandomForestModel)
    * [Gradient Boosted Trees](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/GradientBoostedTreesModel)
    * [CART](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/CartModel)
    * [Keras API](https://www.tensorflow.org/api_docs/python/tf/keras)
    * [TensorFlow Decision Forests (TF-DF)](https://www.tensorflow.org/decision_forests).
* [Detailed exploratory data analysis with python](https://www.kaggle.com/code/ekami66/detailed-exploratory-data-analysis-with-python) by Tuatini Godard.
*   TensorFlow Decision Forests tutorials which are a set of 3 very interesting tutorials.
    * [Beginner Tutorial](https://www.tensorflow.org/decision_forests/tutorials/beginner_colab)
    * [Intermediate Tutorial](https://www.tensorflow.org/decision_forests/tutorials/intermediate_colab)
    * [Advanced Tutorial](https://www.tensorflow.org/decision_forests/tutorials/advanced_colab)
*   The [TensorFlow Forum](https://discuss.tensorflow.org/) where one can get in touch with the TensorFlow community. Check it out if you haven't yet.