# Report

In this notebook we report progress on the project of house price prediction.

In [1]:
import os
from functools import partial

In [2]:
import joblib
import pandas as pd

In [3]:
import data
import metrics

In [4]:
%load_ext autoreload
%autoreload 2

In [6]:
def evaluate_model(*, model, metric, X_train, y_train, X_test, y_test):
    train_predictions = model.predict(X_train)
    test_predictions = model.predict(X_test)
    train_error = metric(y_train, train_predictions)
    test_error = metric(y_test, test_predictions)
    return {
        "train_predictions": train_predictions,
        "test_predictions": test_predictions,
        "train_error": train_error,
        "test_error": test_error
    }

def print_report(*, model, evaluation):
    print(f"Model used:\n\t{reg}")
    print(f"Error:\n\ttrain set {evaluation['train_error']}\n\ttest error: {evaluation['test_error']}")

In [9]:
models_dir = "models"

In [10]:
dataset_path = "dataset.csv"

In [11]:
dataset = data.get_dataset(
    partial(pd.read_csv, filepath_or_buffer=dataset_path),
    splits=("train", "test")
)

**If you need to visualize anything from your training data, do it here**

## Baseline

Before doing any complex Machine Learning model, let's try to solve the problem by having an initial educated guess. 

In [41]:
model_path = os.path.join("models", "2021-06-04 15-12", "model.joblib")
reg = joblib.load(model_path)
evaluation = evaluate_model(
    model=reg,
    metric=metrics.custom_error,
    X_train=dataset["train"][0],
    y_train=dataset["train"][1],
    X_test=dataset["test"][0],
    y_test=dataset["test"][1]
)
print_report(model=reg, evaluation=evaluation)

Model used:
	Pipeline(steps=[('average-price-per-neighborhood-regressor',
                 AveragePricePerNeighborhoodRegressor())])
Error:
	train set 33609.9990756996
	test error: 34799.09487677742


## Linear Regression Model 

We want to try easy things first, so know lets see how a linear regression model does.

In [40]:
model_path = os.path.join("models", "2021-06-05 15-50", "model.joblib")
reg = joblib.load(model_path)
evaluation = evaluate_model(
    model=reg,
    metric=metrics.custom_error,
    X_train=dataset["train"][0],
    y_train=dataset["train"][1],
    X_test=dataset["test"][0],
    y_test=dataset["test"][1]
)
print_report(model=reg, evaluation=evaluation)

Model used:
	Pipeline(steps=[('age-extractor', AgeExtractor()),
                ('categorical-encoder',
                 CategoricalEncoder(additional_pass_through_columns=['HouseAge',
                                                                     'RemodAddAge',
                                                                     'GarageAge'],
                                    force_dense_array=True, one_hot=True)),
                ('standard-scaler', StandardScaler()),
                ('linear-regressor', LinearRegression())])
Error:
	train set 10951.953179992095
	test error: 2967241310539.6943


**Error Analysis**

What can you learn about the errors your model is making? Try this:

* Discretize the errors your model is making by some categorical variables.
* Sort and discretize the errors your model is making and see what the features have in common in those cases. 

## Linear regression with Feature Engineering

Probably the previous model is not good enough, let's see how is the performance of a model using some produced features.

Techniques:
1. Feature Cross
2. Discretizer
3. Add average per neighborhood.


In [39]:
model_path = os.path.join("models", "2021-06-05 15-48", "model.joblib")
reg = joblib.load(model_path)
evaluation = evaluate_model(
    model=reg,
    metric=metrics.custom_error,
    X_train=dataset["train"][0],
    y_train=dataset["train"][1],
    X_test=dataset["test"][0],
    y_test=dataset["test"][1]
)
print_report(model=reg, evaluation=evaluation)

Model used:
	Pipeline(steps=[('age-extractor', AgeExtractor()),
                ('averager', AveragePricePerNeighborhoodExtractor()),
                ('discretizer',
                 Discretizer(bins_per_column={'LotArea': 3, 'LotFrontage': 5},
                             strategy='quantile')),
                ('categorical-encoder',
                 CategoricalEncoder(additional_categories={'LotArea': [0.0, 1.0,
                                                                       2.0],
                                                           'LotFrontage': [0.0,
                                                                           1.0,
                                                                           2.0,
                                                                           3.0,
                                                                           4.0]},
                                    additional_pass_through_columns=['HouseAge',
                      

**Error Analysis**

What can you learn about the errors your model is making? Try this:

* Discretize the errors your model is making by some categorical variables.
* Sort or discretize the errors your model is making and see what the features have in common in those cases. 

## Regularized Linear Regression

Let's assume you are overfitting. Load the results of a linear regression model with regularized loss

In [36]:
model_path = os.path.join("models", "2021-06-05 15-43", "model.joblib")
reg = joblib.load(model_path)
evaluation = evaluate_model(
    model=reg,
    metric=metrics.custom_error,
    X_train=dataset["train"][0],
    y_train=dataset["train"][1],
    X_test=dataset["test"][0],
    y_test=dataset["test"][1]
)
print_report(model=reg, evaluation=evaluation)

Model used:
	Pipeline(steps=[('age-extractor', AgeExtractor()),
                ('discretizer',
                 Discretizer(bins_per_column={'LotArea': 3, 'LotFrontage': 5},
                             strategy='quantile')),
                ('categorical-encoder',
                 CategoricalEncoder(additional_categories={'LotArea': [0.0, 1.0,
                                                                       2.0],
                                                           'LotFrontage': [0.0,
                                                                           1.0,
                                                                           2.0,
                                                                           3.0,
                                                                           4.0]},
                                    additional_pass_through_columns=['HouseAge',
                                                                     'RemodAddAge',
        

**Error Analysis**

What can you learn about the errors your model is making? Try this:

* Discretize the errors your model is making by some categorical variables.
* Sort or discretize the errors your model is making and see what the features have in common in those cases. 

## Decision Tree

Decision trees ofer great complexity, they can fit even a noisy dataset almost perfectly. Let's see how it behaves on the task at hand. 

**Overfiting case**
Let's see the results for a model that has greatly overfit the data, this wouldn't be an ideal model, but at least it could tell that our model is powerful enough for the task at hand

In [42]:
model_path = os.path.join("models", "2021-06-05 15-51", "model.joblib")
reg = joblib.load(model_path)
evaluation = evaluate_model(
    model=reg,
    metric=metrics.custom_error,
    X_train=dataset["train"][0],
    y_train=dataset["train"][1],
    X_test=dataset["test"][0],
    y_test=dataset["test"][1]
)
print_report(model=reg, evaluation=evaluation)

Model used:
	Pipeline(steps=[('age-extractor', AgeExtractor()),
                ('categorical-encoder',
                 CategoricalEncoder(additional_pass_through_columns=['HouseAge',
                                                                     'RemodAddAge',
                                                                     'GarageAge'],
                                    force_dense_array=True)),
                ('decision-tree-regressor', DecisionTreeRegressor())])
Error:
	train set 0.0
	test error: 27141.272597526167


**Error Analysis**

What can you learn about the errors your model is making? Try this:

* Discretize the errors your model is making by some categorical variables.
* Sort or discretize the errors your model is making and see what the features have in common in those cases. 

**Using best hyper params** Now let's see thow much a simple decision tree can give us

In [43]:
model_path = os.path.join("models", "2021-06-05 15-54", "model.joblib")
reg = joblib.load(model_path)
evaluation = evaluate_model(
    model=reg,
    metric=metrics.custom_error,
    X_train=dataset["train"][0],
    y_train=dataset["train"][1],
    X_test=dataset["test"][0],
    y_test=dataset["test"][1]
)
print_report(model=reg, evaluation=evaluation)

Model used:
	Pipeline(steps=[('age-extractor', AgeExtractor()),
                ('categorical-encoder',
                 CategoricalEncoder(additional_pass_through_columns=['HouseAge',
                                                                     'RemodAddAge',
                                                                     'GarageAge'],
                                    force_dense_array=True)),
                ('decision-tree-regressor',
                 DecisionTreeRegressor(max_depth=8, max_features='auto'))])
Error:
	train set 8884.676197932415
	test error: 28098.13544441429


## Random Forest

Now it is time to use a model that can properly help us to regularize the previous one.

In [44]:
model_path = os.path.join("models", "2021-06-05 16-08", "model.joblib")
reg = joblib.load(model_path)
evaluation = evaluate_model(
    model=reg,
    metric=metrics.custom_error,
    X_train=dataset["train"][0],
    y_train=dataset["train"][1],
    X_test=dataset["test"][0],
    y_test=dataset["test"][1]
)
print_report(model=reg, evaluation=evaluation)

Model used:
	Pipeline(steps=[('age-extractor', AgeExtractor()),
                ('categorical-encoder',
                 CategoricalEncoder(additional_pass_through_columns=['HouseAge',
                                                                     'RemodAddAge',
                                                                     'GarageAge'],
                                    force_dense_array=True)),
                ('random-forest-regressor',
                 RandomForestRegressor(max_depth=18, max_features=20,
                                       n_estimators=512))])
Error:
	train set 5741.839429708461
	test error: 17727.675141267133


**Error Analysis**

What can you learn about the errors your model is making? Try this:

* Discretize the errors your model is making by some categorical variables.
* Sort or discretize the errors your model is making and see what the features have in common in those cases. 