# Predicting real estate prices - Model serving and deployment

## Background
Sound Realty helps people sell homes in the Seattle area.

They currently spend too much time and effort on estimating the value of properties.

One of their staff has heard a lot about machine learning (ML) and has created a basic model to estimate the value of properties.

The basic model uses only numeric variables and ignores some other attributes.
Despite the simplicity of this model, the folks at Sound are impressed with the proof of concept and would now like to use this model to streamline
their business.

They have contracted us to help deploy that model for broader use.
Our job is to create a REST endpoint that serves up model predictions for new data, and to provide guidance on how they could improve the model.

## Proposed Solution

Here I shall deploy the model to a REST endpoint using Modal.


## Library Installation and Import

Below I shall install then import the libraries needed.

In [2]:
%pip install seaborn
%pip install tqdm
%pip install wandb
%pip install sweetviz
%pip install gradio
%pip install dash
%pip install streamlit plotly requests
%pip install modal bentoml 

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Collecting wandb
  Downloading wandb-0.22.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (10 kB)
Collecting sentry-sdk>=2.0.0 (from wandb)
  Downloading sentry_sdk-2.38.0-py2.py3-none-any.whl.metadata (10 kB)
Downloading wandb-0.22.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.6/19.6 MB[0m [31m19.0 MB/s[0m  [33m0:00:01[0m eta [36m0:00:01[0m
Downloading sentry_sdk-2.38.0-py2.py3-none-any.whl (370 kB)
Installing collected packages: sentry-sdk, wandb
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2/2[0

In [None]:
%uv pip install seaborn
%uv pip install tqdm
%uv pip install wandb
%uv pip install sweetviz
%uv pip install gradio
%uv pip install dash
%uv pip install streamlit plotly requests
%uv pip install modal bentoml 

### Imports

In [1]:
import pandas as pd, matplotlib.pyplot as plt, seaborn as sns, numpy as np
from numpy import random
from tqdm import tqdm
from ipywidgets import interact
from pathlib import Path
import os, warnings, io, getpass, json, dash, modal, bentoml, gc, wandb, pickle
from joblib import dump, load
from dash import dcc, html, dash_table
import typing as t
from bentoml.validators import DataframeSchema
np.set_printoptions(linewidth=130)
plt.rc('image', cmap='Greys')

## Exploratory Data Analysis

We have 3 datasets namely
- **kc_house_data.csv** – Data for training the model
- **zipcode_demographics.csv** – Additional demographic data from the U.S. Census which are used as features. This data should be joined to the primary home sales using the zipcode column.
- **future_unseen_examples.csv** – This file contains examples of homes to be sold in the future. It includes all attributes from the original home sales file, but not the price , date , or id . It also does not include the demographic data.


Lets first take a look at our dataset

In [10]:
path = Path('..')
path

PosixPath('..')

In [11]:
!ls ../data

future_unseen_examples.csv  kc_house_data.csv  zipcode_demographics.csv


In [12]:
train_df = pd.read_csv(path/'data/kc_house_data.csv', index_col='id')
demographics_df = pd.read_csv(path/'data/zipcode_demographics.csv')
test_df = pd.read_csv(path/'data/future_unseen_examples.csv')

In [13]:
train_df

Unnamed: 0_level_0,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
7129300520,20141013T000000,221900.0,3,1.00,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.7210,-122.319,1690,7639
5631500400,20150225T000000,180000.0,2,1.00,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
2487200875,20141209T000000,604000.0,4,3.00,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
1954400510,20150218T000000,510000.0,3,2.00,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
263000018,20140521T000000,360000.0,3,2.50,1530,1131,3.0,0,0,3,8,1530,0,2009,0,98103,47.6993,-122.346,1530,1509
6600060120,20150223T000000,400000.0,4,2.50,2310,5813,2.0,0,0,3,8,2310,0,2014,0,98146,47.5107,-122.362,1830,7200
1523300141,20140623T000000,402101.0,2,0.75,1020,1350,2.0,0,0,3,7,1020,0,2009,0,98144,47.5944,-122.299,1020,2007
291310100,20150116T000000,400000.0,3,2.50,1600,2388,2.0,0,0,3,8,1600,0,2004,0,98027,47.5345,-122.069,1410,1287


In [None]:
#train_df??

In [14]:
train_df.columns

Index(['date', 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',
       'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long',
       'sqft_living15', 'sqft_lot15'],
      dtype='object')

In [15]:
test_df.columns

Index(['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long',
       'sqft_living15', 'sqft_lot15'],
      dtype='object')

In [16]:
demographics_df.columns

Index(['ppltn_qty', 'urbn_ppltn_qty', 'sbrbn_ppltn_qty', 'farm_ppltn_qty',
       'non_farm_qty', 'medn_hshld_incm_amt', 'medn_incm_per_prsn_amt',
       'hous_val_amt', 'edctn_less_than_9_qty', 'edctn_9_12_qty',
       'edctn_high_schl_qty', 'edctn_some_clg_qty', 'edctn_assoc_dgre_qty',
       'edctn_bchlr_dgre_qty', 'edctn_prfsnl_qty', 'per_urbn', 'per_sbrbn',
       'per_farm', 'per_non_farm', 'per_less_than_9', 'per_9_to_12', 'per_hsd',
       'per_some_clg', 'per_assoc', 'per_bchlr', 'per_prfsnl', 'zipcode'],
      dtype='object')

In [24]:
!ls ../model

model.pkl  model_features.json


In [26]:
import json
from pathlib import Path

# Check different possible locations
possible_paths = [
    Path("model_features.json"),  # current directory
    Path("model/model_features.json"),  # model subdirectory
    Path("../model/model_features.json"),  # parent directory
    Path("data/model_features.json"),  # data directory if you have one
]

for path in possible_paths:
    if path.exists():
        print(f"Found features file at: {path}")
        model_features = json.loads(path.read_text())
        print("Model features:", model_features)
        break
else:
    print("model_features.json not found in any expected location")

Found features file at: ../model/model_features.json
Model features: ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'sqft_above', 'sqft_basement', 'ppltn_qty', 'urbn_ppltn_qty', 'sbrbn_ppltn_qty', 'farm_ppltn_qty', 'non_farm_qty', 'medn_hshld_incm_amt', 'medn_incm_per_prsn_amt', 'hous_val_amt', 'edctn_less_than_9_qty', 'edctn_9_12_qty', 'edctn_high_schl_qty', 'edctn_some_clg_qty', 'edctn_assoc_dgre_qty', 'edctn_bchlr_dgre_qty', 'edctn_prfsnl_qty', 'per_urbn', 'per_sbrbn', 'per_farm', 'per_non_farm', 'per_less_than_9', 'per_9_to_12', 'per_hsd', 'per_some_clg', 'per_assoc', 'per_bchlr', 'per_prfsnl']


In [23]:
# Using pathlib (more robust)
features_path = Path("model/model_features.json")

if features_path.exists():
    print(f"Loading model features from {features_path}")
    model_features = json.loads(features_path.read_text())
    print("Model features:", model_features)
    print(f"Number of features: {len(model_features)}")
else:
    print(f"Features file not found at {features_path}")

Features file not found at model/model_features.json


In [37]:
len(train_df.columns),len(test_df.columns)

(20, 18)

In [23]:
demographics_df

Unnamed: 0,ppltn_qty,urbn_ppltn_qty,sbrbn_ppltn_qty,farm_ppltn_qty,non_farm_qty,medn_hshld_incm_amt,medn_incm_per_prsn_amt,hous_val_amt,edctn_less_than_9_qty,edctn_9_12_qty,...,per_farm,per_non_farm,per_less_than_9,per_9_to_12,per_hsd,per_some_clg,per_assoc,per_bchlr,per_prfsnl,zipcode
0,38249.0,37394.0,0.0,0.0,855.0,66051.0,25219.0,192000.0,437.0,2301.0,...,0.0,2.0,1.0,6.0,18.0,20.0,5.0,12.0,4.0,98042
1,22036.0,22036.0,0.0,0.0,0.0,91904.0,53799.0,573900.0,149.0,404.0,...,0.0,0.0,0.0,1.0,6.0,12.0,3.0,27.0,22.0,98040
2,18194.0,18194.0,0.0,0.0,0.0,61813.0,31765.0,246600.0,269.0,905.0,...,0.0,0.0,1.0,4.0,13.0,20.0,6.0,19.0,9.0,98028
3,21956.0,21956.0,0.0,0.0,0.0,47461.0,22158.0,175400.0,925.0,1773.0,...,0.0,0.0,4.0,8.0,20.0,21.0,5.0,12.0,4.0,98178
4,22814.0,22814.0,0.0,0.0,0.0,48606.0,28398.0,252600.0,599.0,1148.0,...,0.0,0.0,2.0,5.0,13.0,17.0,5.0,23.0,12.0,98007
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65,35140.0,35021.0,0.0,0.0,119.0,81929.0,41856.0,335900.0,212.0,865.0,...,0.0,0.0,0.0,2.0,8.0,15.0,4.0,27.0,15.0,98006
66,23926.5,23298.0,0.0,0.0,0.0,56933.0,27639.5,239850.0,406.0,1213.0,...,0.0,0.0,1.0,5.0,15.0,19.0,5.0,19.0,7.5,98074
67,23926.5,23298.0,0.0,0.0,0.0,56933.0,27639.5,239850.0,406.0,1213.0,...,0.0,0.0,1.0,5.0,15.0,19.0,5.0,19.0,7.5,98077
68,23926.5,23298.0,0.0,0.0,0.0,56933.0,27639.5,239850.0,406.0,1213.0,...,0.0,0.0,1.0,5.0,15.0,19.0,5.0,19.0,7.5,98030


In [24]:
test_df

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,4,1.00,1680,5043,1.5,0,0,4,6,1680,0,1911,0,98118,47.5354,-122.273,1560,5765
1,3,2.50,2220,6380,1.5,0,0,4,8,1660,560,1931,0,98115,47.6974,-122.313,950,6380
2,3,2.25,1630,10962,1.0,0,0,4,8,1100,530,1977,0,98030,47.3801,-122.166,1830,8470
3,5,2.50,1710,9720,2.0,0,0,4,8,1710,0,1974,0,98005,47.5903,-122.157,2270,9672
4,2,1.00,850,6370,1.0,0,0,3,6,850,0,1951,0,98126,47.5198,-122.373,850,5170
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,3,2.50,2430,54059,2.0,0,0,3,10,2430,0,1987,0,98027,47.4664,-121.992,2910,49658
96,2,2.50,1240,1249,3.0,0,0,3,8,1240,0,2006,0,98107,47.6718,-122.386,1240,2500
97,4,1.75,1860,9750,1.0,0,0,3,7,1460,400,1969,0,98034,47.7097,-122.202,1900,8913
98,5,1.75,2330,3800,1.5,0,0,3,7,1360,970,1927,0,98115,47.6835,-122.308,2100,3800


In [26]:
!ls model

model.pkl  model_features.json


In [27]:
model_path = path/'model/model.pkl'

In [30]:
if os.path.exists(model_path):
           print(f"Loading existing model from pickle at {model_path}")
           with open(model_path, 'rb') as f:
               model = pickle.load(f)
           print(model)

Loading existing model from pickle at model/model.pkl
Pipeline(steps=[('robustscaler', RobustScaler()),
                ('kneighborsregressor', KNeighborsRegressor())])


In [31]:
def load_model(model_path="model/model.pkl"):
    """
    Function to load the model

    Args
    model_path: the path to the model
    """
    if os.path.exists(model_path):
        print(f"Loading existing model from pickle at {model_path}")
        with open(model_path, 'rb') as f:
            model = pickle.load(f)
        return model
    else:
        print(f"Model file not found at {model_path}")
        return None

In [32]:
load_model

<function __main__.load_model(model_path='model/model.pkl')>

In [34]:
load_model??

[0;31mSignature:[0m [0mload_model[0m[0;34m([0m[0mmodel_path[0m[0;34m=[0m[0;34m'model/model.pkl'[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m <no docstring>
[0;31mSource:[0m   
[0;32mdef[0m [0mload_model[0m[0;34m([0m[0mmodel_path[0m[0;34m=[0m[0;34m"model/model.pkl"[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;32mif[0m [0mos[0m[0;34m.[0m[0mpath[0m[0;34m.[0m[0mexists[0m[0;34m([0m[0mmodel_path[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m        [0mprint[0m[0;34m([0m[0;34mf"Loading existing model from pickle at {model_path}"[0m[0;34m)[0m[0;34m[0m
[0;34m[0m        [0;32mwith[0m [0mopen[0m[0;34m([0m[0mmodel_path[0m[0;34m,[0m [0;34m'rb'[0m[0;34m)[0m [0;32mas[0m [0mf[0m[0;34m:[0m[0;34m[0m
[0;34m[0m            [0mmodel[0m [0;34m=[0m [0mpickle[0m[0;34m.[0m[0mload[0m[0;34m([0m[0mf[0m[0;34m)[0m[0;34m[0m
[0;34m[0m        [0;32mreturn[0m [0mmodel[0m[0;34m[0m
[0;34

In [35]:
load_model()

Loading existing model from pickle at model/model.pkl


In [36]:
model.predict(test_df)

ValueError: The feature names should match those that were passed during fit.
Feature names unseen at fit time:
- condition
- grade
- lat
- long
- sqft_living15
- ...
Feature names seen at fit time, yet now missing:
- edctn_9_12_qty
- edctn_assoc_dgre_qty
- edctn_bchlr_dgre_qty
- edctn_high_schl_qty
- edctn_less_than_9_qty
- ...


In [31]:
# inference_preprocessor.py
import pandas as pd
import numpy as np
import joblib
import json

class InferencePreprocessor:
    def __init__(self, model_features, medians=None):
        self.model_features = model_features
        self.medians = medians or {}

    def _normalize_zip(self, series):
        return series.astype(str).str.zfill(5).str[:5]

    def fit_medians(self, df):
        """Compute medians (from training merge). Run this ONCE during training."""
        self.medians = {feat: float(df[feat].median(skipna=True)) for feat in self.model_features}
        return self

    def transform(self, df, demo_df):
        """Transform new data for model.predict"""
        df = df.copy()
        demo_df = demo_df.copy()

        df["zipcode"] = self._normalize_zip(df["zipcode"])
        demo_df["zipcode"] = self._normalize_zip(demo_df["zipcode"])

        merged = pd.merge(df, demo_df, on="zipcode", how="left")

        # Ensure all required features exist
        for feat in self.model_features:
            if feat not in merged.columns:
                merged[feat] = np.nan

        X = merged[self.model_features].astype(float)

        # Fill NAs using training medians (if available)
        if self.medians:
            X = X.fillna(pd.Series(self.medians))
        else:
            # fallback: compute medians from this batch
            X = X.fillna(X.median())

        return X

    def save(self, path):
        joblib.dump({"medians": self.medians, "features": self.model_features}, path)

    @classmethod
    def load(cls, path):
        obj = joblib.load(path)
        return cls(obj["features"], obj["medians"])


# =====================
# Example usage
# =====================
if __name__ == "__main__":
    # Load model + features
    with open("../model/model_features.json") as f:
        model_features = json.load(f)

    model = joblib.load("../model/model.pkl")
    demo = pd.read_csv("../data/zipcode_demographics.csv")
    test = pd.read_csv("../data/future_unseen_examples.csv")

    # Load preprocessor (already fitted during training)
    pre = InferencePreprocessor.load("artifacts/preprocessor.joblib")

    # Prepare features for prediction
    X_test = pre.transform(test, demo)

    # Predict
    preds = model.predict(X_test)
    test["predicted_price"] = preds
    test.to_csv("predictions.csv", index=False)
    print("Predictions written to predictions.csv")


FileNotFoundError: [Errno 2] No such file or directory: '../artifacts/preprocessor.joblib'

In [40]:
new_test_df = pd.concat([test_df, demographics_df], ignore_index=True)
new_test_df

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,...,per_sbrbn,per_farm,per_non_farm,per_less_than_9,per_9_to_12,per_hsd,per_some_clg,per_assoc,per_bchlr,per_prfsnl
0,4.0,1.00,1680.0,5043.0,1.5,0.0,0.0,4.0,6.0,1680.0,...,,,,,,,,,,
1,3.0,2.50,2220.0,6380.0,1.5,0.0,0.0,4.0,8.0,1660.0,...,,,,,,,,,,
2,3.0,2.25,1630.0,10962.0,1.0,0.0,0.0,4.0,8.0,1100.0,...,,,,,,,,,,
3,5.0,2.50,1710.0,9720.0,2.0,0.0,0.0,4.0,8.0,1710.0,...,,,,,,,,,,
4,2.0,1.00,850.0,6370.0,1.0,0.0,0.0,3.0,6.0,850.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
165,,,,,,,,,,,...,0.0,0.0,0.0,0.0,2.0,8.0,15.0,4.0,27.0,15.0
166,,,,,,,,,,,...,0.0,0.0,0.0,1.0,5.0,15.0,19.0,5.0,19.0,7.5
167,,,,,,,,,,,...,0.0,0.0,0.0,1.0,5.0,15.0,19.0,5.0,19.0,7.5
168,,,,,,,,,,,...,0.0,0.0,0.0,1.0,5.0,15.0,19.0,5.0,19.0,7.5


In [41]:
# Using pathlib (more robust)
features_path = Path("model/model_features.json")

if features_path.exists():
    print(f"Loading model features from {features_path}")
    model_features = json.loads(features_path.read_text())
    print("Model features:", model_features)
    print(f"Number of features: {len(model_features)}")
else:
    print(f"Features file not found at {features_path}")

Loading model features from model/model_features.json
Model features: ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'sqft_above', 'sqft_basement', 'ppltn_qty', 'urbn_ppltn_qty', 'sbrbn_ppltn_qty', 'farm_ppltn_qty', 'non_farm_qty', 'medn_hshld_incm_amt', 'medn_incm_per_prsn_amt', 'hous_val_amt', 'edctn_less_than_9_qty', 'edctn_9_12_qty', 'edctn_high_schl_qty', 'edctn_some_clg_qty', 'edctn_assoc_dgre_qty', 'edctn_bchlr_dgre_qty', 'edctn_prfsnl_qty', 'per_urbn', 'per_sbrbn', 'per_farm', 'per_non_farm', 'per_less_than_9', 'per_9_to_12', 'per_hsd', 'per_some_clg', 'per_assoc', 'per_bchlr', 'per_prfsnl']
Number of features: 33


In [42]:
# Assuming model_features is a list like ['feature1', 'feature2', 'feature3', ...]

# When preparing data for prediction:
def prepare_prediction_data(df, required_features):
    """Ensure DataFrame has required features in correct order"""
    
    # Check if all required features are present
    missing_features = set(required_features) - set(df.columns)
    if missing_features:
        print(f"Warning: Missing features: {missing_features}")
    
    # Select and reorder columns to match training data
    df_prepared = df[required_features]
    print(f"Data prepared with {len(required_features)} features in correct order")
    
    return df_prepared



In [45]:
# Example usage
prediction_data = prepare_prediction_data(new_test_df, model_features)
predictions = model.predict(prediction_data)

Data prepared with 33 features in correct order


ValueError: Input X contains NaN.
KNeighborsRegressor does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values