# House Price Prediction using TFDF
# Introduction

[TensorFlow Decision Forests](https://www.tensorflow.org/decision_forests)
is a collection of state-of-the-art algorithms of Decision Forest models
that are compatible with [Keras APIs](https://www.tensorflow.org/api_docs/python/tf/keras)
.
The models include [Random Forests](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/RandomForestModel),
[Gradient Boosted Trees](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/GradientBoostedTreesModel),
and [CART](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/CartModel),
and can be used for regression, classification, and ranking tasks.
For a beginner's guide to TensorFlow Decision Forests,
please refer to this [tutorial](https://www.tensorflow.org/decision_forests/tutorials/beginner_colab).
Decision Forests are a family of tree-based models including Random Forests and Gradient Boosted Trees. They are the best place to start when working with tabular data, and will often outperform (or provide a strong baseline) before you begin experimenting with neural networks.

In this example we will use TensorFlow to train each of these on a dataset you load from a CSV file. This is a common pattern in practice. Roughly, your code will look as follows:

```
import tensorflow_decision_forests as tfdf
import pandas as pd
  
dataset = pd.read_csv("project/dataset.csv")
tf_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(dataset, label="my_label",  task=tfdf.keras.Task.REGRESSION)

model = tfdf.keras.RandomForestModel()
model.fit(tf_dataset)
  
print(model.summary())
```

## Setup

### Install TensorFlow Decision Forests

There are many excellent libraries for working with tree-based models, including [scikit-learn](https://scikit-learn.org/) (highly recommended for all your ML needs), XGBoost, LightGBM, and others.

In this example we'll use [TensorFlow Decision Forests (TF-DF)](https://www.tensorflow.org/decision_forests), a relatively new library used to train large models. 

In [1]:
!pip install tensorflow==2.9.1 -U -qq
!pip install tensorflow_decision_forests==0.2.7 -U -qq

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tfx-bsl 1.9.0 requires pyarrow<6,>=1, but you have pyarrow 8.0.0 which is incompatible.
tensorflow-transform 1.9.0 requires pyarrow<6,>=1, but you have pyarrow 8.0.0 which is incompatible.
tensorflow-io 0.21.0 requires tensorflow<2.7.0,>=2.6.0, but you have tensorflow 2.9.1 which is incompatible.
tensorflow-io 0.21.0 requires tensorflow-io-gcs-filesystem==0.21.0, but you have tensorflow-io-gcs-filesystem 0.27.0 which is incompatible.
flax 0.6.0 requires rich~=11.1, but you have rich 12.1.0 which is incompatible.[0m[31m
[0m

### Import the library

In [2]:
# Scientific computing # 
import numpy as np     # Numpy Documentation -  https://numpy.org/doc/stable/ 

# -  Data processing - #
import pandas as pd    # Pandas Documentation - https://pandas.pydata.org/docs/

# ---- Tensorflow ---- #
import tensorflow as tf
import tensorflow_decision_forests as tfdf

2022-09-16 20:46:23.525673: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/lib:/usr/local/lib/x86_64-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:
2022-09-16 20:46:23.525768: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [3]:
print("TensorFlow v" + tf.__version__)
print("TensorFlow Decision Forests v" + tfdf.__version__)

TensorFlow v2.9.1
TensorFlow Decision Forests v0.2.7


# Load the dataset
Note: Pandas is practical as you don't have to type in name of the input features to load them. For larger datasets (>1M examples), using the TensorFlow Dataset to read the files may be better suited.

In [4]:
train_file_path = "../input/house-prices-advanced-regression-techniques/train.csv"
train_full_data = pd.read_csv(train_file_path)
print("Full train dataset shape is {}".format(train_full_data.shape))

Full train dataset shape is (1460, 81)


The data is composed of 81 columns and 1460 entries. We can see all 81 dimensions of our dataset by printing out the first 3 entries using the following code: 


In [5]:
train_full_data.head(3)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500



* 79 feature columns.
* Label column named `SalePrice`.
* We will drop `Id` column as it is not necessary for model training.

In [6]:
train_full_data = train_full_data.drop('Id', axis=1)

In [7]:
train_full_data.head(3)

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
1,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500
2,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500


Refer to the [competition page](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data) for a comprehensive guide to the data.

# Exploratory Data Analysis (EDA)
Data scientists use exploratory analysis techniques to analyze and visualize large datasets. This process helps them identify the main characteristics of their data sets and develop effective strategies to get the answers they need. It can also help them spot anomalies and test hypotheses.

For this dataset, there are some amazing notebooks already available on Kaggle. One of them is [Detailed exploratory data analysis with python](https://www.kaggle.com/code/ekami66/detailed-exploratory-data-analysis-with-python) by Tuatini Godard.

# Prepare the dataset
This dataset contains a mix of numeric, categorical and missing features. TF-DF supports all these feature types natively, and no preprocessing is required. This is one advantage of tree-based models; making them a great entry point to tensorflow and ML.

In [8]:
def split_dataset(dataset, test_ratio=0.10):
  test_indices = np.random.rand(len(dataset)) < test_ratio
  return dataset[~test_indices], dataset[test_indices]

train_ds_pd, val_ds_pd = split_dataset(train_full_data)
print("{} examples in training, {} examples in validation.".format(
    len(train_ds_pd), len(val_ds_pd)))

1306 examples in training, 154 examples in validation.


There's one more step required before you can train your model. You need to convert from Pandas format (`pd.DataFrame`) into TensorFlow format (`tf.data.Dataset`). A single line helper function that will do this for you: 

```
tfdf.keras.pd_dataframe_to_tf_dataset(your_df, label='your_label', task=tfdf.keras.Task.REGRESSION)
```

This is a high [performance](https://www.tensorflow.org/guide/data_performance) data loading library which is helpful when training neural networks with accelerators like [GPUs](https://cloud.google.com/gpu) and [TPUs](https://cloud.google.com/tpu). It is not necessary for tree-based models until you begin to do distributed training.

Note that tf.data is a bit tricky to use, and has a learning curve. There are guides on [tensorflow.org/guide](https://www.tensorflow.org/guide) to help.

In [9]:
label = 'SalePrice'
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(
    train_ds_pd, 
    label = label, 
    task = tfdf.keras.Task.REGRESSION)

val_ds = tfdf.keras.pd_dataframe_to_tf_dataset(
    val_ds_pd, 
    label = label, 
    task = tfdf.keras.Task.REGRESSION)

  features_dataframe = dataframe.drop(label, 1)
2022-09-16 20:46:29.061044: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/lib:/usr/local/lib/x86_64-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:
2022-09-16 20:46:29.061113: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-09-16 20:46:29.061156: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (3afcafd27a4a): /proc/driver/nvidia/version does not exist
2022-09-16 20:46:29.061563: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them 

# Create and train a Random Forest 

In [10]:
model = tfdf.keras.RandomForestModel(task = tfdf.keras.Task.REGRESSION)

Use /tmp/tmp45m6f4hw as temporary training directory


In [11]:
model.compile(metrics=["mse"])
model.fit(x=train_ds)

Reading training dataset...
Training dataset read in 0:00:06.966001. Found 1306 examples.
Training model...


[INFO kernel.cc:1176] Loading model from path /tmp/tmp45m6f4hw/model/ with prefix 4f7c211671c1484c


Model trained in 0:00:02.976722
Compiling model...


[INFO abstract_model.cc:1248] Engine "RandomForestOptPred" built
[INFO kernel.cc:1022] Use fast generic engine


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: could not get source code
Model compiled.


<keras.callbacks.History at 0x7f3fd7cbbf50>

# Visualize your model
One benefit of tree-based models is that you can easily visualize them. The default number of trees used in the Random Forest is 300. You can select a tree to display below.

In [12]:
tfdf.model_plotter.plot_model_in_colab(model, tree_idx=0, max_depth=3)

# Evaluate the model

In [13]:
import math     #built-in module that you can use for 
                #mathematical tasks (eg. square root)
                
evaluation = model.evaluate(val_ds, return_dict=True)

print(evaluation)
print(f"MSE: {evaluation['mse']:.2f}")
print(f"RMSE: {math.sqrt(evaluation['mse']):.2f}")

{'loss': 0.0, 'mse': 993684800.0}
MSE: 993684800.00
RMSE: 31522.77


# Test Set Prediction

In [14]:
test_file_path = "../input/house-prices-advanced-regression-techniques/test.csv"
test_data = pd.read_csv(test_file_path)
ids = test_data.pop('Id')

In [15]:
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(
    test_data, 
    task = tfdf.keras.Task.REGRESSION)

In [16]:
preds = model.predict(test_ds)
output = pd.DataFrame({'Id': ids,
                       'SalePrice': preds.squeeze()})

output.head()



Unnamed: 0,Id,SalePrice
0,1461,126086.6875
1,1462,156190.3125
2,1463,180349.140625
3,1464,183884.25
4,1465,194836.25


In [17]:
sample_submission_df = pd.read_csv('../input/house-prices-advanced-regression-techniques/sample_submission.csv')
sample_submission_df['SalePrice'] = model.predict(test_ds)
sample_submission_df.to_csv('/kaggle/working/submission.csv', index=False)
sample_submission_df.head()



Unnamed: 0,Id,SalePrice
0,1461,126086.6875
1,1462,156190.3125
2,1463,180349.140625
3,1464,183884.25
4,1465,194836.25


# References
* Dive deep into 
    * [Random Forests](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/RandomForestModel)
    * [Gradient Boosted Trees](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/GradientBoostedTreesModel)
    * [CART](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/CartModel)
    * [Keras API](https://www.tensorflow.org/api_docs/python/tf/keras)
    * [TensorFlow Decision Forests (TF-DF)](https://www.tensorflow.org/decision_forests).
* [Detailed exploratory data analysis with python](https://www.kaggle.com/code/ekami66/detailed-exploratory-data-analysis-with-python) by Tuatini Godard.
*   TensorFlow Decision Forests tutorials which are a set of 3 very interesting tutorials.
    * [Beginner Tutorial](https://www.tensorflow.org/decision_forests/tutorials/beginner_colab)
    * [Intermediate Tutorial](https://www.tensorflow.org/decision_forests/tutorials/intermediate_colab)
    * [Advanced Tutorial](https://www.tensorflow.org/decision_forests/tutorials/advanced_colab)
*   The [TensorFlow Forum](https://discuss.tensorflow.org/) where one can get in touch with the TensorFlow community. Check it out if you haven't yet.