# Report

In this notebook we report progress on the project of house price prediction.

In [19]:
import os
from functools import partial

In [20]:
import joblib
import pandas as pd

In [21]:
import data
import metrics

In [22]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [28]:
def evaluate_model(*, model, metric, X_train, y_train, X_test, y_test):
    train_predictions = model.predict(X_train)
    test_predictions = model.predict(X_test)
    train_error = metric(y_train, train_predictions)
    test_error = metric(y_test, test_predictions)
    return {
        "train_predictions": train_predictions,
        "test_predictions": test_predictions,
        "train_error": train_error,
        "test_error": test_error
    }

def print_report(*, model, evaluation):
    print(f"Model used:\n\t{reg}")
    print(f"Error:\n\ttrain set {evaluation['train_error']}\n\ttest error: {evaluation['test_error']}")

In [29]:
models_dir = "models"

In [30]:
dataset_path = "dataset.csv"

In [31]:
dataset = data.get_dataset(
    partial(pd.read_csv, filepath_or_buffer=dataset_path),
    splits=("train", "test")
)

**If you need to visualize anything from your training data, do it here**

## Baseline

Before doing any complex Machine Learning model, let's try to solve the problem by having an initial educated guess. 

In [32]:
model_path = os.path.join("models", "2021-06-04 15-12", "model.joblib")
reg = joblib.load(model_path)
evaluation = evaluate_model(
    model=reg,
    metric=metrics.custom_error,
    X_train=dataset["train"][0],
    y_train=dataset["train"][1],
    X_test=dataset["test"][0],
    y_test=dataset["test"][1]
)
print_report(model=reg, evaluation=evaluation)

Model used:
	Pipeline(steps=[('average-price-per-neighborhood-regressor',
                 AveragePricePerNeighborhoodRegressor())])
Error:
	train set 33609.9990756996
	test error: 34799.09487677742


## Linear Regression Model 

We want to try easy things first, so know lets see how a linear regression model does.

In [44]:
model_path = os.path.join("models", "2021-06-05 02-46", "model.joblib")
reg = joblib.load(model_path)
evaluation = evaluate_model(
    model=reg,
    metric=metrics.custom_error,
    X_train=dataset["train"][0],
    y_train=dataset["train"][1],
    X_test=dataset["test"][0],
    y_test=dataset["test"][1]
)
print_report(model=reg, evaluation=evaluation)

Model used:
	Pipeline(steps=[('age-extractor', AgeExtractor()),
                ('categorical-encoder',
                 CategoricalEncoder(additional_pass_through_columns=['HouseAge',
                                                                     'RemodAddAge',
                                                                     'GarageAge'],
                                    force_dense_array=True, one_hot=True)),
                ('standard-scaler', StandardScaler()),
                ('linear-regressor', LinearRegression())])
Error:
	train set 10918.49482187895
	test error: 10603307471585.064


**Error Analysis**

What can you learn about the errors your model is making? Try this:

* Discretize the errors your model is making by some categorical variables.
* Sort and discretize the errors your model is making and see what the features have in common in those cases. 

## Linear regression with Feature Engineering

Probably the previous model is not good enough, let's see how is the performance of a model using some produced features.

Techniques:
1. Feature Cross
2. Discretizer
3. Add average per neighborhood.


In [46]:
model_path = os.path.join("models", "2021-06-05 02-49", "model.joblib")
reg = joblib.load(model_path)
evaluation = evaluate_model(
    model=reg,
    metric=metrics.custom_error,
    X_train=dataset["train"][0],
    y_train=dataset["train"][1],
    X_test=dataset["test"][0],
    y_test=dataset["test"][1]
)
print_report(model=reg, evaluation=evaluation)

Model used:
	Pipeline(steps=[('age-extractor', AgeExtractor()),
                ('averager', AveragePricePerNeighborhoodExtractor()),
                ('categorical-encoder',
                 CategoricalEncoder(additional_pass_through_columns=['HouseAge',
                                                                     'RemodAddAge',
                                                                     'GarageAge',
                                                                     'AveragePriceInNeihborhood'],
                                    force_dense_array=True, one_hot=True)),
                ('standard-scaler', StandardScaler()),
                ('linear-regressor', LinearRegression())])
Error:
	train set 10956.657887374604
	test error: 17232262514823.115


**Error Analysis**

What can you learn about the errors your model is making? Try this:

* Discretize the errors your model is making by some categorical variables.
* Sort or discretize the errors your model is making and see what the features have in common in those cases. 

## Regularized Linear Regression

Let's assume you are overfitting. Load the results of a linear regression model with regularized loss

In [40]:
model_path = os.path.join("models", "2021-06-05 02-43", "model.joblib")
reg = joblib.load(model_path)
evaluation = evaluate_model(
    model=reg,
    metric=metrics.custom_error,
    X_train=dataset["train"][0],
    y_train=dataset["train"][1],
    X_test=dataset["test"][0],
    y_test=dataset["test"][1]
)
print_report(model=reg, evaluation=evaluation)

Model used:
	Pipeline(steps=[('age-extractor', AgeExtractor()),
                ('categorical-encoder',
                 CategoricalEncoder(additional_pass_through_columns=['HouseAge',
                                                                     'RemodAddAge',
                                                                     'GarageAge'],
                                    force_dense_array=True, one_hot=True)),
                ('standard-scaler', StandardScaler()),
                ('ridge-regressor', Ridge(alpha=100))])
Error:
	train set 11278.049274904646
	test error: 17979.97582848677


**Error Analysis**

What can you learn about the errors your model is making? Try this:

* Discretize the errors your model is making by some categorical variables.
* Sort or discretize the errors your model is making and see what the features have in common in those cases. 

## Decision Tree

Decision trees ofer great complexity, they can fit even a noisy dataset almost perfectly. Let's see how it behaves on the task at hand. 

**Overfiting case**
Let's see the results for a model that has greatly overfit the data, this wouldn't be an ideal model, but at least it could tell that our model is powerful enough for the task at hand

In [None]:
model_path = os.path.join("models", "", "model.joblib")
reg = joblib.load(model_path)
evaluation = evaluate_model(
    model=reg,
    metric=metrics.custom_error,
    X_train=dataset["train"][0],
    y_train=dataset["train"][1],
    X_test=dataset["test"][0],
    y_test=dataset["test"][1]
)
print_report(model=reg, evaluation=evaluation)

**Error Analysis**

What can you learn about the errors your model is making? Try this:

* Discretize the errors your model is making by some categorical variables.
* Sort or discretize the errors your model is making and see what the features have in common in those cases. 

**Using best hyper params** Now let's see thow much a simple decision tree can give us

In [None]:
model_path = os.path.join("models", "", "model.joblib")
reg = joblib.load(model_path)
evaluation = evaluate_model(
    model=reg,
    metric=metrics.custom_error,
    X_train=dataset["train"][0],
    y_train=dataset["train"][1],
    X_test=dataset["test"][0],
    y_test=dataset["test"][1]
)
print_report(model=reg, evaluation=evaluation)

## Random Forest

Now it is time to use a model that can properly help us to regularize the previous one.

In [None]:
model_path = os.path.join("models", "", "model.joblib")
reg = joblib.load(model_path)
evaluation = evaluate_model(
    model=reg,
    metric=metrics.custom_error,
    X_train=dataset["train"][0],
    y_train=dataset["train"][1],
    X_test=dataset["test"][0],
    y_test=dataset["test"][1]
)
print_report(model=reg, evaluation=evaluation)

**Error Analysis**

What can you learn about the errors your model is making? Try this:

* Discretize the errors your model is making by some categorical variables.
* Sort or discretize the errors your model is making and see what the features have in common in those cases. 