# Predicting House Prices

---

This notebook is part of the [CaTabRa GitHub repository](https://github.com/risc-mi/catabra).

This short example demonstrates how to create a model for predicting house prices with CaTabRa:

* [prepare data](#Prepare-Data),
* [train a regression model](#Analyze-Data-and-Train-Model),
* [evaluate the model](#Evaluate-Model), and
* [explain the model](#Explain-Model).

Familiarity with CaTabRa's main data analysis workflow is assumed. A step-by-step introduction can be found in [Workflow.ipynb](https://github.com/risc-mi/catabra/tree/main/examples/Workflow.ipynb).

## Prerequisites

In [1]:
from catabra.util import io

In [2]:
# output directory (where all generated artifacts, like statistics, models, etc. are saved)
output_dir = 'house_sales'

## Prepare Data

In [3]:
# load dataset
from sklearn.datasets import fetch_openml
X, y = fetch_openml(data_id=44066, return_X_y=True, as_frame=True)

In [4]:
# add target labels to DataFrame
X['price'] = y

In [5]:
# split into train- and test set by adding column with corresponding values
# the name of the column is arbitrary; CaTabRa tries to "guess" which samples belong to which set based on the column name and -values
X['train'] = X['date_year'] == '0'    # temporal split

In [6]:
X.head()

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,waterfront,grade,sqft_above,sqft_basement,yr_built,yr_renovated,lat,long,sqft_living15,sqft_lot15,date_year,date_month,date_day,price,train
0,3.0,1.0,1180.0,5650.0,0,7.0,1180.0,0.0,1955.0,0.0,47.5112,-122.257,1340.0,5650.0,0,10.0,13.0,12.309987,True
1,3.0,2.25,2570.0,7242.0,0,7.0,2170.0,400.0,1951.0,1991.0,47.721,-122.319,1690.0,7639.0,0,12.0,9.0,13.195616,True
2,2.0,1.0,770.0,10000.0,0,6.0,770.0,0.0,1933.0,0.0,47.7379,-122.233,2720.0,8062.0,1,2.0,25.0,12.100718,False
3,4.0,3.0,1960.0,5000.0,0,7.0,1050.0,910.0,1965.0,0.0,47.5208,-122.393,1360.0,5000.0,0,12.0,9.0,13.311331,True
4,3.0,2.0,1680.0,8080.0,0,8.0,1680.0,0.0,1987.0,0.0,47.6168,-122.045,1800.0,7503.0,1,2.0,18.0,13.142168,False


## Analyze Data and Train Model

In [7]:
from catabra.analysis import analyze

analyze(
    X,                        # table to analyze; can also be the path to a CSV/Excel/HDF5 file
    regress='price',          # name of column containing regression target
    split='train',            # name of column containing information about the train-test split (optional)
    time=3,                   # time budget for hyperparameter tuning, in minutes (optional)
    jobs=2,                   # number of parallel jobs
    out=output_dir
)

[CaTabRa] ### Analysis started at 2023-04-19 14:53:07.415165




[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Using AutoML-backend auto-sklearn for regression
[CaTabRa] Successfully loaded the following auto-sklearn add-on module(s): xgb
[CaTabRa] Using auto-sklearn 1.0 (regression not supported by 2.0).


  self.metafeatures = self.metafeatures.append(metafeatures)
  self.algorithm_runs[metric].append(runs)


[CaTabRa] New model #1 trained:
    val_r2: 0.891634
    val_mean_absolute_error: 0.123526
    val_mean_squared_error: 0.030498
    train_r2: 0.984437
    type: random_forest
    total_elapsed_time: 00:30
[CaTabRa] New model #2 trained:
    val_r2: 0.904801
    val_mean_absolute_error: 0.116970
    val_mean_squared_error: 0.026792
    train_r2: 0.976546
    type: gradient_boosting
    total_elapsed_time: 00:35
[CaTabRa] New model #3 trained:
    val_r2: 0.902060
    val_mean_absolute_error: 0.119535
    val_mean_squared_error: 0.027564
    train_r2: 0.981697
    type: gradient_boosting
    total_elapsed_time: 00:37
[CaTabRa] New model #4 trained:
    val_r2: 0.897691
    val_mean_absolute_error: 0.122485
    val_mean_squared_error: 0.028793
    train_r2: 0.994704
    type: gradient_boosting
    total_elapsed_time: 00:48
[CaTabRa] New model #5 trained:
    val_r2: 0.768186
    val_mean_absolute_error: 0.187598
    val_mean_squared_error: 0.065241
    train_r2: 1.000000
    type: k_neare

## Evaluate Model

The model was automatically evaluated after training, because we specified a train-test split. We can inspect the results:

In [8]:
metrics = io.read_df(output_dir + '/eval/not_train/metrics.xlsx')

In [9]:
metrics

Unnamed: 0.1,Unnamed: 0,n,r2,mean_absolute_error,mean_squared_error,root_mean_squared_error,mean_squared_log_error,median_absolute_error,mean_absolute_percentage_error,max_error,explained_variance,mean_poisson_deviance,mean_gamma_deviance
0,price,6980,0.845543,0.151756,0.042936,0.207209,0.000215,0.110974,0.011558,1.19069,0.857237,0.003273,0.00025
1,__overall__,6980,0.845543,0.151756,0.042936,0.207209,0.000215,0.110974,0.011558,1.19069,0.857237,0.003273,0.00025


Also check out `/eval/not_train/static_plots/price.pdf`, which shows a scatter plot of ground-truth vs. predicted house prices.

## Explain Model

In [10]:
from catabra.explanation import explain

explain(
    X,
    folder=output_dir,       # directory containing trained model (= output directory of previous call to `analyze()`)
    from_invocation=output_dir + '/invocation.json',
    out=output_dir + '/explain',
    explainer='permutation'  # can be omitted for using SHAP, but SHAP takes very long in this case ...
)

[CaTabRa] ### Explanation started at 2023-04-19 14:57:20.962098
[CaTabRa] *** Split train
Features: 100%|########################################| 17/17 [00:33<00:00, 1.94s/it]  
[CaTabRa] *** Split not_train
Features: 100%|########################################| 17/17 [00:13<00:00, 1.28it/s]  
[CaTabRa] ### Explanation finished at 2023-04-19 14:58:23.904994
[CaTabRa] ### Elapsed time: 0 days 00:01:02.942896
[CaTabRa] ### Output saved in /mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/house_sales/explain


In [11]:
importance = io.read_df(output_dir + '/explain/not_train/__ensemble__.h5')

In [12]:
importance.sort_values('r2', ascending=False)

Unnamed: 0,r2,mean_absolute_error,mean_squared_error,r2 std,mean_absolute_error std,mean_squared_error std
lat,0.550714,0.199182,0.153086,0.005703,0.001403,0.001585
sqft_living,0.201067,0.082459,0.055892,0.00328,0.00134,0.000912
grade,0.150816,0.060628,0.041924,0.004245,0.001382,0.00118
long,0.075237,0.034909,0.020914,0.00094,0.000536,0.000261
sqft_living15,0.025162,0.011922,0.006994,0.00185,0.000653,0.000514
sqft_lot,0.020695,0.012332,0.005753,0.001152,0.000607,0.00032
waterfront,0.01339,0.003592,0.003722,0.000506,0.000205,0.000141
bathrooms,0.010384,0.005163,0.002887,0.000498,0.000217,0.000139
yr_built,0.008597,0.004961,0.00239,0.000567,0.000129,0.000158
sqft_lot15,0.005792,0.003533,0.00161,0.000316,0.0002,8.8e-05


Also check out `/explain/not_train/static_plots/` for visualizations of the permutation importance.