# House Price Prediction Using Keras/TensorFlow

This notebook builds a neural network to predict house prices. In addition to using the mean squared error loss, we add a differentiable KL divergence penalty (based on a soft histogram of the predictions and targets) to encourage the model’s output distribution to match that of the actual SalePrice. The notebook also plots training and validation losses and overlays the distribution histograms of actual and predicted prices.

In [20]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

%matplotlib inline
print('TensorFlow version:', tf.__version__)

TensorFlow version: 2.18.0


## 1. Load the Data

Replace the placeholder paths with your actual file paths for `train.csv` and `test.csv`.

In [21]:
# Replace these paths with the actual file paths
train_path = 'train.csv'
test_path = 'test.csv'

train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)

print('Train shape:', train_df.shape)
print('Test shape:', test_df.shape)

Train shape: (1000, 81)
Test shape: (460, 80)


## 2. Data Preprocessing

Fill missing values (numeric with median, categorical with mode) and one-hot encode categorical features. Also drop the `Id` column and separate out the target variable `SalePrice`.

In [22]:
def fill_missing_values(df):
    # Fill numeric columns with median
    num_cols = df.select_dtypes(include=['int64', 'float64']).columns
    for col in num_cols:
        df[col].fillna(df[col].median(), inplace=True)
    
    # Fill categorical columns with mode
    cat_cols = df.select_dtypes(include=['object']).columns
    for col in cat_cols:
        df[col].fillna(df[col].mode()[0], inplace=True)
    return df

# Process training data
train_df = fill_missing_values(train_df.copy())

# Save and drop the Id column
train_ids = train_df['Id']
train_df.drop('Id', axis=1, inplace=True)

# Separate target variable and features
y = train_df['SalePrice']
# Instead of this:
# y = train_df['SalePrice']
# Use:
y = np.log(train_df['SalePrice'])

X = train_df.drop('SalePrice', axis=1)

# One-hot encode categorical features
X = pd.get_dummies(X, drop_first=True)

print('Processed training features shape:', X.shape)

Processed training features shape: (1000, 230)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values

## 2.1 Feature Selection

Compute the correlation matrix (using only numeric columns) and select the top five features (by absolute correlation with SalePrice).

In [23]:
# Compute correlation matrix using only numeric columns
numeric_df = train_df.select_dtypes(include=[np.number])
corr_matrix = numeric_df.corr()

# Get top 5 features (excluding SalePrice itself)
top_features = corr_matrix['SalePrice'].abs().sort_values(ascending=False).iloc[1:6].index.tolist()
print('Top 5 features selected:', top_features)

# Keep only these features in X
X = X[top_features]
print('X shape after feature selection:', X.shape)

Top 5 features selected: ['OverallQual', 'GrLivArea', 'GarageCars', 'GarageArea', 'TotalBsmtSF']
X shape after feature selection: (1000, 5)


### 2.2 Train-Validation Split and Scaling

Split the data (80% train, 20% validation) and scale the features.

In [24]:
# Split the data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=42)

# Fit the scaler on training features and transform both train and validation sets
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)

print('X_train_scaled shape:', X_train_scaled.shape)
print('X_val_scaled shape:', X_val_scaled.shape)

X_train_scaled shape: (800, 5)
X_val_scaled shape: (200, 5)


## 3. Define a Combined Loss with KL Divergence

We create differentiable (soft) histograms and compute the KL divergence between the target and prediction distributions. The combined loss is MSE plus an alpha-weighted KL divergence.

In [25]:
def soft_histogram(x, bin_centers, sigma):
    # x is a 1D tensor; reshape to (batch, 1)
    x = tf.reshape(x, [-1, 1])
    diff = x - bin_centers  # (batch, num_bins)
    soft_counts = tf.exp(-tf.square(diff) / (2.0 * sigma**2))
    hist = tf.reduce_sum(soft_counts, axis=0)
    hist = hist / tf.reduce_sum(hist)
    return hist

def kl_divergence_loss(y_true, y_pred, num_bins=50, sigma=1.0):
    # Use the range of y_true for binning; if targets are very large or skewed, consider log-transforming y_true and y_pred first.
    min_val = tf.reduce_min(y_true)
    max_val = tf.reduce_max(y_true)
    bin_centers = tf.linspace(min_val, max_val, num_bins)
    
    hist_true = soft_histogram(y_true, bin_centers, sigma)
    hist_pred = soft_histogram(y_pred, bin_centers, sigma)
    
    # Increase epsilon for clipping to avoid extreme small values
    epsilon = 1e-6
    hist_true = tf.clip_by_value(hist_true, epsilon, 1.0)
    hist_pred = tf.clip_by_value(hist_pred, epsilon, 1.0)
    
    kl_loss = tf.reduce_sum(hist_true * tf.math.log(hist_true / hist_pred))
    
    # Debug print to check intermediate values
    tf.print("KL loss:", kl_loss, "hist_true:", hist_true, "hist_pred:", hist_pred)
    tf.debugging.check_numerics(kl_loss, message="KL divergence loss produced NaN")
    
    return kl_loss

def combined_loss(y_true, y_pred, alpha=0.001):  # Try a very low alpha
    mse_loss = tf.reduce_mean(tf.square(y_true - y_pred))
    kl_loss = kl_divergence_loss(y_true, y_pred, num_bins=50, sigma=1.0)
    total_loss = mse_loss + alpha * kl_loss
    tf.debugging.check_numerics(total_loss, message="Combined loss produced NaN")
    return total_loss

# Optionally, if SalePrice is very skewed, try a log transform:
# y = np.log(train_df['SalePrice'])
# (and then later use np.exp on the predictions)

# Compile the model with the new loss function
model.compile(optimizer='adam', loss=lambda y_true, y_pred: combined_loss(y_true, y_pred, alpha=0.001))


## 4. Train the Model

Train the model using the combined loss function. Training and validation losses are recorded.

In [26]:
history = model.fit(X_train_scaled, y_train,
                    epochs=100,
                    batch_size=32,
                    validation_data=(X_val_scaled, y_val),
                    verbose=1)

Epoch 1/100
KL loss: -nan hist_true: [0.0092057595 0.00972897094 0.0102670277 ... 0.0219340902 0.0214404203 0.0209200922] hist_pred: [-nan -nan -nan ... -nan -nan -nan]


2025-02-25 04:07:11.355329: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: INVALID_ARGUMENT: KL divergence loss produced NaN : Tensor had NaN values
	 [[{{function_node __inference_one_step_on_data_36358}}{{node compile_loss/lambda/CheckNumerics}}]]


InvalidArgumentError: Graph execution error:

Detected at node compile_loss/lambda/CheckNumerics defined at (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main

  File "<frozen runpy>", line 88, in _run_code

  File "/home/codespace/.local/lib/python3.12/site-packages/ipykernel_launcher.py", line 18, in <module>

  File "/home/codespace/.local/lib/python3.12/site-packages/traitlets/config/application.py", line 1075, in launch_instance

  File "/home/codespace/.local/lib/python3.12/site-packages/ipykernel/kernelapp.py", line 739, in start

  File "/home/codespace/.local/lib/python3.12/site-packages/tornado/platform/asyncio.py", line 205, in start

  File "/usr/local/python/3.12.1/lib/python3.12/asyncio/base_events.py", line 638, in run_forever

  File "/usr/local/python/3.12.1/lib/python3.12/asyncio/base_events.py", line 1971, in _run_once

  File "/usr/local/python/3.12.1/lib/python3.12/asyncio/events.py", line 84, in _run

  File "/home/codespace/.local/lib/python3.12/site-packages/ipykernel/kernelbase.py", line 545, in dispatch_queue

  File "/home/codespace/.local/lib/python3.12/site-packages/ipykernel/kernelbase.py", line 534, in process_one

  File "/home/codespace/.local/lib/python3.12/site-packages/ipykernel/kernelbase.py", line 437, in dispatch_shell

  File "/home/codespace/.local/lib/python3.12/site-packages/ipykernel/ipkernel.py", line 362, in execute_request

  File "/home/codespace/.local/lib/python3.12/site-packages/ipykernel/kernelbase.py", line 778, in execute_request

  File "/home/codespace/.local/lib/python3.12/site-packages/ipykernel/ipkernel.py", line 449, in do_execute

  File "/home/codespace/.local/lib/python3.12/site-packages/ipykernel/zmqshell.py", line 549, in run_cell

  File "/home/codespace/.local/lib/python3.12/site-packages/IPython/core/interactiveshell.py", line 3075, in run_cell

  File "/home/codespace/.local/lib/python3.12/site-packages/IPython/core/interactiveshell.py", line 3130, in _run_cell

  File "/home/codespace/.local/lib/python3.12/site-packages/IPython/core/async_helpers.py", line 128, in _pseudo_sync_runner

  File "/home/codespace/.local/lib/python3.12/site-packages/IPython/core/interactiveshell.py", line 3334, in run_cell_async

  File "/home/codespace/.local/lib/python3.12/site-packages/IPython/core/interactiveshell.py", line 3517, in run_ast_nodes

  File "/home/codespace/.local/lib/python3.12/site-packages/IPython/core/interactiveshell.py", line 3577, in run_code

  File "/tmp/ipykernel_1848/3275033335.py", line 1, in <module>

  File "/usr/local/python/3.12.1/lib/python3.12/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/usr/local/python/3.12.1/lib/python3.12/site-packages/keras/src/backend/tensorflow/trainer.py", line 371, in fit

  File "/usr/local/python/3.12.1/lib/python3.12/site-packages/keras/src/backend/tensorflow/trainer.py", line 219, in function

  File "/usr/local/python/3.12.1/lib/python3.12/site-packages/keras/src/backend/tensorflow/trainer.py", line 132, in multi_step_on_iterator

  File "/usr/local/python/3.12.1/lib/python3.12/site-packages/keras/src/backend/tensorflow/trainer.py", line 113, in one_step_on_data

  File "/usr/local/python/3.12.1/lib/python3.12/site-packages/keras/src/backend/tensorflow/trainer.py", line 60, in train_step

  File "/usr/local/python/3.12.1/lib/python3.12/site-packages/keras/src/trainers/trainer.py", line 383, in _compute_loss

  File "/usr/local/python/3.12.1/lib/python3.12/site-packages/keras/src/trainers/trainer.py", line 351, in compute_loss

  File "/usr/local/python/3.12.1/lib/python3.12/site-packages/keras/src/trainers/compile_utils.py", line 691, in __call__

  File "/usr/local/python/3.12.1/lib/python3.12/site-packages/keras/src/trainers/compile_utils.py", line 700, in call

  File "/usr/local/python/3.12.1/lib/python3.12/site-packages/keras/src/losses/loss.py", line 67, in __call__

  File "/usr/local/python/3.12.1/lib/python3.12/site-packages/keras/src/losses/losses.py", line 33, in call

  File "/tmp/ipykernel_1848/3066061520.py", line 44, in <lambda>

  File "/tmp/ipykernel_1848/3066061520.py", line 34, in combined_loss

  File "/tmp/ipykernel_1848/3066061520.py", line 28, in kl_divergence_loss

KL divergence loss produced NaN : Tensor had NaN values
	 [[{{node compile_loss/lambda/CheckNumerics}}]] [Op:__inference_multi_step_on_iterator_36410]

## 5. Evaluate the Model and Visualize Performance

Compute the RMSE on the validation set, plot training/validation loss curves, and overlay the distribution histograms of actual vs. predicted prices.

In [27]:
# Evaluate on validation data
y_val_pred = model.predict(X_val_scaled).flatten()
print("Any NaNs in predictions?", np.any(np.isnan(y_val_pred)))
rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))
print('Validation RMSE:', rmse)



# Plot training and validation loss
plt.figure(figsize=(8, 6))
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Combined Loss')
plt.legend()
plt.title('Training and Validation Loss')
plt.show()

# Plot distribution histograms (validation set)
plt.figure(figsize=(8, 4))
sns.histplot(y_val, bins=50, color='blue', alpha=0.5, stat='density', label='Actual')
sns.histplot(y_val_pred, bins=50, color='red', alpha=0.5, stat='density', label='Predicted')
plt.xlabel('SalePrice')
plt.ylabel('Density')
plt.title('Distribution of Actual vs Predicted SalePrice (Validation)')
plt.legend()
plt.show()

[1m1/7[0m [32m━━[0m[37m━━━━━━━━━━━━━━━━━━[0m [1m0s[0m 37ms/step

[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step 
Any NaNs in predictions? True


ValueError: Input contains NaN.

## 6. Prepare the Test Data and Save Predictions

Process the test data (using the same selected features and scaling), predict house prices, and save the results.

In [None]:
# Process test data
test_df = fill_missing_values(test_df.copy())
test_ids = test_df['Id']
test_df.drop('Id', axis=1, inplace=True)
test_df = pd.get_dummies(test_df, drop_first=True)
test_df = test_df.reindex(columns=X.columns, fill_value=0)
test_scaled = scaler.transform(test_df)

# Predict on test data
test_predictions = model.predict(test_scaled).flatten()

# Save predictions
predictions_df = pd.DataFrame({
    'ID': test_ids,
    'SALEPRICE': test_predictions
})
predictions_csv_path = predictions_keras_KL.csv'
predictions_df.to_csv(predictions_csv_path, index=False)
print('Predictions saved to', predictions_csv_path)