# Modeling and Evaluation

## 1. Objectives

- Train two regression models
- Evaluate the performance of trained models
- Improve their performance of possible

## 2. Inputs

- /workspace/house-price-regression/outputs/datasets/cleaned/TrainSetCleaned.csv
- /workspace/house-price-regression/outputs/datasets/cleaned/TestSetCleaned.csv

## 3. Outputs

- /workspace/house-price-regression/outputs/models/regression/random_forest_regressor.pkl
- /workspace/house-price-regression/outputs/models/regression/linear_regression.pkl

## 4. Imports

In [1]:
import pandas as pd
import numpy as np

## 5. Load Data

We will load the cleaned data with selected features and without missing values.

In [2]:
import os

# Get the current working directory (cwd)
cwd = os.getcwd()
print(f"[*] Previous working directory: {cwd}")

# Make the parent of the cwd the new cwd
os.chdir(os.path.dirname(cwd))
cwd = os.getcwd()
print(f"[*] Updated current working directory: {cwd}")

# Load the data
df_train = pd.read_csv("outputs/datasets/cleaned/TrainSetCleaned.csv")
df_test = pd.read_csv("outputs/datasets/cleaned/TestSetCleaned.csv")

[*] Previous working directory: /workspace/house-price-regression/jupyter_notebooks
[*] Updated current working directory: /workspace/house-price-regression


## 6. Modeling

- Create Random Forest Regressor model
- Create Linear Regression model

### 6.1 Create Training and Validation Data

In order to maximise accuracy, we'll split the training data into smaller data sets. The larger will be the new training data set, while the smaller will be used for validation.

The CSV file provided for testing purposes contains no labels, so it cannot be used for this purpose. We'll focus on the training data set instead and split that into **training**, **validation** and **test** sets.

In [3]:
from sklearn.model_selection import train_test_split

X = df_train.drop("SalePrice", axis=1)
y = df_train["SalePrice"]

X_train, X_val_and_test, y_train, y_val_and_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

X_val, X_test, y_val, y_test = train_test_split(
    X_val_and_test, y_val_and_test, test_size=0.5, random_state=42)

In [4]:
#Verify lengths of created training and validation sets
print(f"X_train length: {len(X_train)}\ny_train length: {len(y_train)}\n")
print(f"X_val length: {len(X_val)}\ny_val length: {len(y_val)}\n")
print(f"X_test length: {len(X_test)}\ny_test length: {len(y_test)}")

X_train length: 1022
y_train length: 1022

X_val length: 219
y_val length: 219

X_test length: 219
y_test length: 219


In [5]:
y_test

1418    124000
44      141000
588     143000
1048    115000
439     110000
         ...  
1094    129000
374     219500
411     145000
1134    169000
292     131000
Name: SalePrice, Length: 219, dtype: int64

### 6.2 Creating an evaluation function

The metrics for evaluating regression models are:

- Mean Absolute Error (MAE)
- Root Mean Square Log Error (RMSLE)
- R squared (R^2)

We'll create a function to evaluate our two models.

In [27]:
from sklearn.metrics import mean_squared_log_error, mean_absolute_error

# Create function to output tne rmsle as sklearn doesn't have such an in-built function
def rmsle(y_test, y_predictions):
    return np.sqrt(mean_squared_log_error(y_test, y_predictions))

# Function for model evaluation
def show_scores(model):
    train_predictions = model.predict(X_train)
    val_predictions = model.predict(X_val)
    scores = {"Training MAE": mean_absolute_error(y_train, train_predictions),
              "Validation MAE": mean_absolute_error(y_val, val_predictions),
              "Training RMSLE": rmsle(y_train, train_predictions),
              "Validation RMSLE": rmsle(y_val, val_predictions),
              "Training R^2": model.score(X_train, y_train),
              "Validation R^2": model.score(X_val, y_val)}
    return scores

### 6.3 Random Forest Regressor Model

In [7]:
from sklearn.ensemble import RandomForestRegressor

# Create instance of RandomForestRegressor model
rf_model = RandomForestRegressor(n_jobs=-1)

In [8]:
# Fit the model
rf_model.fit(X_train, y_train)

RandomForestRegressor(n_jobs=-1)

In [9]:
# Show scores
show_scores(rf_model)

{'Training MAE': 7800.457051579535,
 'Validation MAE': 18185.88323331159,
 'Training RMSLE': 0.0670654624158919,
 'Validation RMSLE': 0.16430383379024108,
 'Training R^2': 0.9742834327217318,
 'Validation R^2': 0.8857920515893536}

#### 6.3.1 Hyperparameter tuning with RandomizedSearchCV

Use randomised search with cross validation to find the best parameters for the model.

In [10]:
from sklearn.model_selection import RandomizedSearchCV

# RandomForestClassifier hyperparameters for testing
rf_grid = {"n_estimators": np.arange(10, 100, 10),
           "max_depth": [None, 3, 5, 10],
           "min_samples_split": np.arange(2, 20, 2),
           "min_samples_leaf": np.arange(1, 20, 2),
           "max_features": [0.5, 1, "sqrt", "auto"],
           "max_samples": [800]}

rscv_model = RandomizedSearchCV(RandomForestRegressor(),
                              param_distributions=rf_grid,
                              n_iter=100,
                              cv=5,
                              verbose=True)

rscv_model.fit(X_train, y_train)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


RandomizedSearchCV(cv=5, estimator=RandomForestRegressor(), n_iter=100,
                   param_distributions={'max_depth': [None, 3, 5, 10],
                                        'max_features': [0.5, 1, 'sqrt',
                                                         'auto'],
                                        'max_samples': [800],
                                        'min_samples_leaf': array([ 1,  3,  5,  7,  9, 11, 13, 15, 17, 19]),
                                        'min_samples_split': array([ 2,  4,  6,  8, 10, 12, 14, 16, 18]),
                                        'n_estimators': array([10, 20, 30, 40, 50, 60, 70, 80, 90])},
                   verbose=True)

In [11]:
# Show the best parameters for the model
rscv_model.best_params_

{'n_estimators': 40,
 'min_samples_split': 4,
 'min_samples_leaf': 7,
 'max_samples': 800,
 'max_features': 0.5,
 'max_depth': 10}

In [12]:
# Evaluate the optimised model
show_scores(rscv_model)

{'Training MAE': 17263.04257894391,
 'Validation MAE': 18746.42976399382,
 'Training RMSLE': 0.14337602722783224,
 'Validation RMSLE': 0.16990372323721753,
 'Training R^2': 0.8607503083609582,
 'Validation R^2': 0.8868594247224642}

Training a model with optimised parameters.

In [13]:
rf_model_ideal = RandomForestRegressor(n_jobs=-1,
                                       n_estimators=40,
                                       min_samples_split=4,
                                       min_samples_leaf=7,
                                       max_samples=800,
                                       max_features=0.5,
                                       max_depth=10)

In [14]:
# Fit optimised model
rf_model_ideal.fit(X_train, y_train)

RandomForestRegressor(max_depth=10, max_features=0.5, max_samples=800,
                      min_samples_leaf=7, min_samples_split=4, n_estimators=40,
                      n_jobs=-1)

In [15]:
# Predict values
rf_predictions = rf_model_ideal.predict(X_test)
rf_predictions[:50]

# Compare predictions to actual values
rf_predictions_comp = pd.DataFrame()
rf_predictions_comp['Predicted'] = rf_predictions
y_test_array = y_test.to_numpy()
rf_predictions_comp['Actual'] = y_test_array

rf_predictions_comp.head()

Unnamed: 0,Predicted,Actual
0,142225.397126,124000
1,138423.643063,141000
2,165914.185971,143000
3,171922.171642,115000
4,130394.00497,110000


## 7. Export trained model a pkl file

In [37]:
# Create outputs/models folder
import os
try:
  os.makedirs(name='outputs/models')
except Exception as e:
  print(e)

In [38]:
import pickle
pickle.dump(rf_model_ideal, open('outputs/models/rf_model.pkl', 'wb'))