In [1]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.metrics import RootMeanSquaredError
from tensorflow.keras.losses import Huber
from tensorflow.keras import regularizers
from time import time
from sklearn.metrics import mean_absolute_error
import wandb
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In my previous post, I went through the following process:

Environment: Anaconda, Windows 11.

- Data wrangling (Light exploration, followed by removing and transforming some variables)
- Split the training dataset into training and validation datasets
- Fit a model using a Scikit-Learn pipeline (Data Preprocessing + fitting XGBoost/LightGBM estimators with a Randomized Search across their respective hyperparameters)
- Evaluate and visualize model performance
- Implement an automated approach to selecting hyperparameters (HyperOpt)
- Make predictions

In this post, I will implement the following using the same wrangled/preprocessed data:

Environment: Docker, Windows Subsystem for Linux 2 (WSL 2), Windows 11.

- Build a simple Sequential model in Keras/Tensorflow
- Use the Weights and Biases (WandB) platform to select optimal hyperparameters and record experiments.
  - Experiments are evaluated using K-Fold Cross Validation. Mean RMSE across folds for each experiment are custom logged in WandB. 
- Make predictions.
- Blend predictions from the previous post (decision tree) and this post (neural net).
  - By both taking the mean of predictions, and defining a meta-model trained on a holdout dataset kept completely separate.

Actually setting up the environment (Docker, Windows Subsystem for Linux 2 (WSL 2), Windows 11, VSCode, using CUDA) took a couple of days and is probably worthy of its own blog post. It required the following process:

- Install WSL2, CUDA drivers and Docker
    - Define a Dockerfile that uses a base image compatible with CUDA
    - Get libraries from requirements.txt
    - Set "runArgs" in devcontainer.json to allow GPU usage
- Run in VSCode (the Jupyter extension gave me some trouble)
    - I ended up creating the container directly from a Dockerfile in the same repo as my code

In [2]:
# Function to bring in wrangled/preprocessed data from previous post
def data():
    training = pd.read_csv("../sklearn/training_preprocessed")
    validation = pd.read_csv("../sklearn/validation_preprocessed")
    holdout = pd.read_csv("../sklearn/holdout_preprocessed")
    holdout_predictions_df = pd.read_csv("../sklearn/holdout_preds_preprocessed")
    test = pd.read_csv("../sklearn/test_preprocessed")
    
    X_train = training.drop(columns="SalePrice")
    y_train = training["SalePrice"]
    X_valid = validation.drop(columns="SalePrice")
    y_valid = validation["SalePrice"]
    X_holdout = holdout.drop(columns="Actual_SalePrice")
    y_holdout = holdout["Actual_SalePrice"]
    X_test = test
    holdout_predictions_df = holdout_predictions_df
    return X_train, y_train, X_valid, y_valid, X_holdout, y_holdout, X_test, holdout_predictions_df

# Bring in data
X_train, y_train, X_valid, y_valid, X_holdout, y_holdout, X_test, holdout_predictions_df = data()

# Since this model uses k-fold validation, we don't need separate training and validation datasets
X_train = X_train.append(X_valid).reset_index().drop(columns="index")
y_train = y_train.append(y_valid).reset_index().drop(columns="index").values

  X_train = X_train.append(X_valid).reset_index().drop(columns="index")
  y_train = y_train.append(y_valid).reset_index().drop(columns="index").values


In [4]:
# Log into Weights and Biases
wandb.init(project="house-price-prediction", entity="luiscostigan")

In [None]:
# Use WandB offline (to prevent syncing after each run)
%env WANDB_MODE=offline

In [5]:
# Allow TF to release memory
%env TF_GPU_ALLOCATOR=cuda_malloc_async

env: TF_GPU_ALLOCATOR=cuda_malloc_async


In [14]:
# Sync offline runs
! wandb sync --sync-all --no-include-online

Syncing: https://wandb.ai/luiscostigan/house-price-prediction/runs/jvzgmc34 ...done.
Syncing: https://wandb.ai/luiscostigan/house-price-prediction/runs/jvzgmc34 ...done.
Syncing: https://wandb.ai/luiscostigan/house-price-prediction/runs/jvzgmc34 ...done.
Syncing: https://wandb.ai/luiscostigan/house-price-prediction/runs/jvzgmc34 ...done.
Syncing: https://wandb.ai/luiscostigan/house-price-prediction/runs/jvzgmc34 ...done.
Syncing: https://wandb.ai/luiscostigan/house-price-prediction/runs/o52yqket ...done.
Syncing: https://wandb.ai/luiscostigan/house-price-prediction/runs/o52yqket ...done.
Syncing: https://wandb.ai/luiscostigan/house-price-prediction/runs/o52yqket ...done.
Syncing: https://wandb.ai/luiscostigan/house-price-prediction/runs/o52yqket ...done.
Syncing: https://wandb.ai/luiscostigan/house-price-prediction/runs/o52yqket ...done.
Syncing: https://wandb.ai/luiscostigan/house-price-prediction/runs/4u0kge6j ...done.
Syncing: https://wandb.ai/luiscostigan/house-price-prediction/run

In [6]:
# Define simple Sequential model
def create_model():
    model = Sequential()
    model.add(Dense(wandb.config.dense1, input_dim=X_train.shape[1], activation='relu', kernel_regularizer=regularizers.l2(wandb.config.bias1)))
    model.add(Dropout(wandb.config.dropout1))
    model.add(Dense(wandb.config.dense2, activation='relu', kernel_regularizer=regularizers.l2(wandb.config.bias2)))
    model.add(Dense(1))
    model.compile(optimizer=Adam(wandb.config.learning_rate), loss="mean_absolute_error", metrics=[RootMeanSquaredError()])
    
    return model

In [11]:
# Define training function and hyperparameter ranges
from wandb.keras import WandbCallback
from sklearn.model_selection import KFold
import multiprocessing

sweep_config = {
  "name": "keras-sequential-model-sweep",
  "method": "bayes",
  "metric": {
    "name": "Mean Validation RMSE (all folds)",
    "goal": "minimize"
  },
  "parameters": {
    "dropout1": {
      "distribution": "uniform",
      "min": 0.0,
      "max": 0.4
    },
    "dense1": {
      "distribution": "categorical",
      "values": [1024, 2048, 4096]
    },
    "dense2": {
      "distribution": "categorical",
      "values": [1024, 2048, 4096]
    },
    "bias1": {
      "distribution": "categorical",
      "values": [0.001, 0.01]
    },
    "bias2": {
      "distribution": "categorical",
      "values": [0.001, 0.01]
    },
    "epochs": {
      "distribution": "categorical",
      "values": [50, 100, 150]
    },
    "batch_size": {
      "distribution": "categorical",
      "values": [8, 16, 32]
    },
    "learning_rate": {
      "distribution": "categorical",
      "values": [0.0001, 0.001, 0.01]
    }
  }
}

config_defaults = {
  "dropout1": 0.1,
  "dense1": 2048,
  "dense2": 2048,
  "bias1": 0.01,
  "bias2": 0.01,
  "epochs": 100,
  "batch_size": 16,
  "learning_rate": 0.001
}

# Define number of splits
kf = KFold(n_splits=5)

def train():

  rmse_per_fold = []
  loss_per_fold = []
  fold_no = 1

  # Go through each split, and get the index number for each
  for train, test in kf.split(X_train, y_train):

    # With the current session in WandB
    with wandb.init(config=config_defaults) as run:

      def evaluate_model():

        # Recreate the model each time for each new batch
        model = None # Not sure if this step is necessary
        model = create_model()

        # Fit model on new batches
        model.fit(
        np.asarray(X_train),
        y_train, 
        epochs=wandb.config.epochs, 
        batch_size=wandb.config.batch_size, 
        verbose=0,
        callbacks=[WandbCallback()], 
        validation_data=(np.asarray(X_train),y_train)
        )
        
        # Generate data for each
        scores = model.evaluate(np.asarray(X_train), y_train, callbacks=[WandbCallback()])
        print(f'Score for fold {fold_no}: {model.metrics_names[0]} of {scores[0]}; {model.metrics_names[1]} of {scores[1]}')
        rmse_per_fold.append(scores[1])
        loss_per_fold.append(scores[0])

        # == Provide average scores ==
        print('------------------------------------------------------------------------')
        print('Score per fold')
        for i in range(0, len(rmse_per_fold)):
            print('------------------------------------------------------------------------')
            print(f'> Fold {i+1} - Loss: {loss_per_fold[i]} - RMSE: {rmse_per_fold[i]}')
            print('------------------------------------------------------------------------')

        wandb.log({
        "Mean Validation RMSE (all folds)": np.mean(rmse_per_fold),
        "Mean Validation Loss (all folds)": np.mean(loss_per_fold) 
        })

        wandb.join()

       # TF has a problem releasing cache after an operation, for which the suggested solution is to start and kill a separate process (https://github.com/tensorflow/tensorflow/issues/36465)
      # However, attempting to do this results in an error with WandB (https://github.com/wandb/client/issues/1994)
      p = multiprocessing.Process(target=evaluate_model) 
      p.start() 
      p.join()

      print('Average scores for all folds:')
      print(f'> RMSE: {np.mean(rmse_per_fold)} (+- {np.std(rmse_per_fold)})')
      print(f'> Loss: {np.mean(loss_per_fold)}')
      print('------------------------------------------------------------------------')

      # Increase fold number
      fold_no = fold_no + 1

keras_sequential_sweep_1 = wandb.sweep(sweep_config, project="house-price-prediction", entity="luiscostigan")

count = 5

wandb.agent(keras_sequential_sweep_1, function=train, count=count)




VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…



Create sweep with ID: 7gy04hx1
Sweep URL: https://wandb.ai/luiscostigan/house-price-prediction/sweeps/7gy04hx1


[34m[1mwandb[0m: Agent Starting Run: p0w8f6p1 with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	bias1: 0.01
[34m[1mwandb[0m: 	bias2: 0.001
[34m[1mwandb[0m: 	dense1: 2048
[34m[1mwandb[0m: 	dense2: 4096
[34m[1mwandb[0m: 	dropout1: 0.2473418433841911
[34m[1mwandb[0m: 	epochs: 150
[34m[1mwandb[0m: 	learning_rate: 0.0001


Problem at: /tmp/ipykernel_323/4151639895.py 74 train


Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/wandb_init.py", line 954, in init
    run = wi.init()
  File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/wandb_init.py", line 614, in init
    backend.cleanup()
  File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/backend/backend.py", line 248, in cleanup
    self.interface.join()
  File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/interface/interface_shared.py", line 467, in join
    super().join()
  File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/interface/interface.py", line 630, in join
    _ = self._communicate_shutdown()
  File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/interface/interface_shared.py", line 464, in _communicate_shutdown
    _ = self._communicate(record)
  File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/interface/interface_shared.py", line 222, in _communicate
    return self._communicate_async(rec, local=local).get(timeout=timeout)
  File

2022-03-13 07:57:55.432188: F tensorflow/stream_executor/cuda/cuda_driver.cc:153] Failed setting context: CUDA_ERROR_NOT_INITIALIZED: initialization error
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
  arrmean = um.true_divide(arrmean, div, out=arrmean, casting='unsafe',
  ret = ret.dtype.type(ret / rcount)


Average scores for all folds:
> RMSE: nan (+- nan)
> Loss: nan
------------------------------------------------------------------------


[34m[1mwandb[0m: [32m[41mERROR[0m Abnormal program exit





VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

[34m[1mwandb[0m: Ctrl + C detected. Stopping sweep.
[34m[1mwandb[0m: [32m[41mERROR[0m Problem finishing run
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/wandb_run.py", line 1711, in _atexit_cleanup
    self._on_finish()
  File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/wandb_run.py", line 1829, in _on_finish
    self._backend.interface.communicate_poll_exit()
  File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/interface/interface.py", line 617, in communicate_poll_exit
    resp = self._communicate_poll_exit(poll_exit)
  File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/interface/interface_shared.py", line 411, in _communicate_poll_exit
    result = self._communicate(rec)
  File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/interface/interface_shared.py", line 222, in _communicate
    return self._communicate_async(rec, local=local).get(timeout=timeout)
  File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/i

2022-03-13 07:58:03.739777: F tensorflow/stream_executor/cuda/cuda_driver.cc:153] Failed setting context: CUDA_ERROR_NOT_INITIALIZED: initialization error
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
  arrmean = um.true_divide(arrmean, div, out=arrmean, casting='unsafe',
  ret = ret.dtype.type(ret / rcount)


Average scores for all folds:
> RMSE: nan (+- nan)
> Loss: nan
------------------------------------------------------------------------



VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

2022-03-13 07:58:12.065223: F tensorflow/stream_executor/cuda/cuda_driver.cc:153] Failed setting context: CUDA_ERROR_NOT_INITIALIZED: initialization error
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
  arrmean = um.true_divide(arrmean, div, out=arrmean, casting='unsafe',
  ret = ret.dtype.type(ret / rcount)


Average scores for all folds:
> RMSE: nan (+- nan)
> Loss: nan
------------------------------------------------------------------------



VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

2022-03-13 07:58:21.307463: F tensorflow/stream_executor/cuda/cuda_driver.cc:153] Failed setting context: CUDA_ERROR_NOT_INITIALIZED: initialization error
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
  arrmean = um.true_divide(arrmean, div, out=arrmean, casting='unsafe',
  ret = ret.dtype.type(ret / rcount)


Average scores for all folds:
> RMSE: nan (+- nan)
> Loss: nan
------------------------------------------------------------------------



VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

2022-03-13 07:58:29.646099: F tensorflow/stream_executor/cuda/cuda_driver.cc:153] Failed setting context: CUDA_ERROR_NOT_INITIALIZED: initialization error
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
  arrmean = um.true_divide(arrmean, div, out=arrmean, casting='unsafe',
  ret = ret.dtype.type(ret / rcount)


Average scores for all folds:
> RMSE: nan (+- nan)
> Loss: nan
------------------------------------------------------------------------



VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

Hyperparameter optimization and experiment recording all took place within the Weights and Biases platform. The set of hyperparameters resulting in the lowest loss (in terms of RMSE) is noted in the top row of the image below:

<img src="./wandb_rmse.png" width="800">

In [17]:
# Enter best params from sweep
best_params = {
    "dropout": 0.3662,
    "dense1": 4096,
    "dense2": 4096,
    "bias1": 0.001,
    "bias2": 0.01,
    "epochs": 150,
    "batch_size": 16,
    "learning_rate": 0.001
}

# Build model using the best parameters
def make_predictions(best_params, dataset):
    
    model = Sequential()
    model.add(Dense(best_params.get("dense1"), input_dim=X_train.shape[1], activation='relu', kernel_regularizer=regularizers.l2(best_params.get("bias1"))))
    model.add(Dropout(best_params.get("dropout")))
    model.add(Dense(best_params.get("dense2"), activation='relu', kernel_regularizer=regularizers.l2(best_params.get("bias2"))))
    model.add(Dense(1))
    model.compile(optimizer=Adam(best_params.get("learning_rate")), loss="mean_absolute_error", metrics=[RootMeanSquaredError()])
    
    model.fit(X_train, y_train, epochs=best_params.get("epochs"), batch_size=best_params.get("batch_size"), verbose = 0)
                
    preds  = model.predict(dataset, best_params.get("batch_size"), verbose = 0)
              
    return preds

In [None]:
# Make predictions
test_predictions_log_transformed = make_predictions(best_params, X_test)

In [19]:
# Generate predictions on holdout dataset
holdout_predictions_log_transformed = make_predictions(best_params, X_holdout)

In [22]:
holdout_predictions = np.exp(holdout_predictions_log_transformed)

In [21]:
# Undo the log transform
test_predictions = np.exp(test_predictions_log_transformed)
holdout_predictions = np.exp(holdout_predictions_log_transformed)

# Generating submission CSV
d = {"Id":X_test.index,"SalePrice":test_predictions.flatten()}
submission = pd.DataFrame(data=d, index=None)

submission["Id"] = submission["Id"] + 1461

submission.to_csv("submission_nn.csv",index=False)

NameError: name 'test_predictions' is not defined

## Blending Predictions

So far, I have generated predictions using a decision tree-based model and a neural net-based model.
First, I'll try taking the mean of predictions from both to see how it performs.

In [23]:
# Read CSVs
decision_tree_predictions = pd.read_csv("../sklearn/xgb_lgb_test_predictions.csv")
neural_net_predictions = pd.read_csv("../tensorflow/submission_nn.csv")

# Merge CSVs on Id column
decision_tree_predictions["NN_Predictions"] = neural_net_predictions["SalePrice"]

# Log transform NN predictions again (for consistent RMSE value)
decision_tree_predictions["NN_Predictions"] = np.log(decision_tree_predictions["NN_Predictions"])

# Add new column with mean
decision_tree_predictions["SalePrice"] = decision_tree_predictions[["NN_Predictions","XGBoost_Predictions","LightGBM_Predictions"]].mean(axis=1)

In [25]:
from sklearn.metrics import mean_squared_error

# Refresh data
X_train, y_train, X_valid, y_valid, X_holdout, y_holdout, X_test, holdout_predictions_df = data()

# Append holdout set NN predictions to DT predictions
holdout_predictions_df["NN_Predictions"] = holdout_predictions_log_transformed
holdout_predictions_df = holdout_predictions_df[["XGBoost_Predictions", "LightGBM_Predictions", "NN_Predictions", "Actual_SalePrice"]]

# Add mean column
holdout_predictions_df["Mean_SalePrice"] = holdout_predictions_df[["XGBoost_Predictions", "LightGBM_Predictions", "NN_Predictions"]].mean(axis=1)

# Add weighted mean column
weight = [0, 0, 1]
holdout_predictions_df["Weighted_Mean_SalePrice"] = holdout_predictions_df[["XGBoost_Predictions", "LightGBM_Predictions", "NN_Predictions"]].dot(weight)

# Calculate RMSE
print(mean_squared_error(holdout_predictions_df["Actual_SalePrice"], holdout_predictions_df["Mean_SalePrice"], squared=False))
print(mean_squared_error(holdout_predictions_df["Actual_SalePrice"], holdout_predictions_df["Weighted_Mean_SalePrice"], squared=False))


0.14691776450949512
0.2549567297876517


In [None]:
# Check weighted mean on holdout set

holdout_predictions_df.applymap(np.exp)

In [None]:
# Drop other columns
decision_tree_predictions = decision_tree_predictions[["Id","SalePrice"]]

# Create mean prediction submission CSV
mean_predictions = decision_tree_predictions
mean_predictions.to_csv("submission_mean.csv",index=False)

Taking the mean of predictions did not beat my score from just using LightGBM in the previous post.

<img src="./mean_predictions_kaggle.png" width="600">

Next, I'll try defining a meta-model to best blend predictions from the two models. After developing each model, predictions were also made on a holdout dataset that was kept separate from the training and validation datasets, for the explicit purpose of training this meta-model. The model trained on this dataset was then used to blend predictions on the test dataset to be submitted to Kaggle.

In [None]:
# Get predictions on test set
test_predictions_dt = pd.read_csv("/root/data-science-projects-1/house-price-prediction/Models/sklearn/submission_dt.csv")
test_predictions_nn = pd.read_csv("/root/data-science-projects-1/house-price-prediction/Models/tensorflow/submission_nn.csv")

# Merge CSVs on Id column
test_predictions_dt["NN_predictions"] = test_predictions_nn["SalePrice"]

# Rename decision tree predictions column
test_predictions_dt = test_predictions_dt.rename(columns={"SalePrice":"DT_predictions"})

# Rename df
blended_predictions_df = test_predictions_dt

The meta-model was a simple grid search across different estimators, without attempting to optimize hyperparameters. Since the holdout dataset is very small, I used 10 folds in the cross-validation process.

In [None]:
import warnings
warnings.filterwarnings("ignore", message=".*Int64Index.*")

from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import make_scorer
import xgboost as xgb
import lightgbm as lgb

# Defining a custom loss function (Root Mean Squared Error)
rmse = make_scorer(mean_squared_error, squared=False)

X = holdout_predictions_df[["XGBoost_Predictions", "LightGBM_Predictions", "NN_Predictions"]]
y = holdout_predictions_df["Actual_SalePrice"]

estimators = [
    {
        "clf": (LinearRegression(),)
    },
    {
        "clf": (Ridge(),)
    },
    {
        "clf": (xgb.XGBRegressor(),)
    },
    {
       "clf": (lgb.LGBMRegressor(),)
    }
]

pipe = Pipeline([("clf", LinearRegression())])

grid_search = GridSearchCV(pipe, estimators, cv=20, scoring=rmse)
grid_search.fit(X,y)
grid_search.score(X,y)

In [None]:
grid_search.cv_results_

In [None]:
# Make predictions 
grid_search_blended_predictions = grid_search.predict(blended_predictions_df[["DT_predictions", "NN_predictions"]])

In [None]:
# Add predictions to a dataframe
grid_search_blended_predictions_df = blended_predictions_df
grid_search_blended_predictions_df["SalePrice"] = grid_search_blended_predictions
grid_search_blended_predictions_df = grid_search_blended_predictions_df.drop(columns=["DT_predictions","NN_predictions"])

# Create CSV with blended predictions
grid_search_blended_predictions_df.to_csv("submission_gridsearch_blended.csv",index=False)

In [None]:
grid_search_blended_predictions_df

I had high hopes for a meta-model that blended predictions, but it performed worse than I expected.



## Improvements to subsequent models

Below I note some improvements to the models I would implement if I had more time.

- Remove outliers (using something like sklearn's IsolationForest)
- 