# Predicting real estate prices - Model serving and deployment

## Background
Sound Realty helps people sell homes in the Seattle area.

They currently spend too much time and effort on estimating the value of properties.

One of their staff has heard a lot about machine learning (ML) and has created a basic model to estimate the value of properties.

The basic model uses only numeric variables and ignores some other attributes.
Despite the simplicity of this model, the folks at Sound are impressed with the proof of concept and would now like to use this model to streamline
their business.

They have contracted us to help deploy that model for broader use.
Our job is to create a REST endpoint that serves up model predictions for new data, and to provide guidance on how they could improve the model.

## Proposed Solution

Here I shall deploy the model to a REST endpoint using Modal.


## Library Installation and Import

Below I shall install then import the libraries needed.

In [2]:
!pip install uv

Collecting uv
  Downloading uv-0.8.19-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Downloading uv-0.8.19-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (20.9 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m20.9/20.9 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: uv
Successfully installed uv-0.8.19


In [80]:
#%pip install seaborn tqdm sweetviz dash streamlit plotly requests gradio joblib scikit-learn ipywidgets modal bentoml wandb

In [1]:
%uv pip install seaborn tqdm sweetviz dash streamlit plotly requests gradio
%uv pip install joblib scikit-learn ipywidgets modal bentoml wandb

[2mUsing Python 3.10.12 environment at: /usr[0m
[2K[2mResolved [1m96 packages[0m [2min 2.68s[0m[0m                                        [0m
[2K[2mPrepared [1m6 packages[0m [2min 24.58s[0m[0m                                            
[1m[31merror[39m[0m: Failed to install: pyparsing-3.2.5-py3-none-any.whl (pyparsing==3.2.5)
  [1m[31mCaused by[39m[0m: failed to create directory `/usr/local/lib/python3.10/dist-packages/pyparsing-3.2.5.dist-info`: Permission denied (os error 13)
Note: you may need to restart the kernel to use updated packages.
[2mUsing Python 3.10.12 environment at: /usr[0m
[2K[2mResolved [1m117 packages[0m [2min 2.09s[0m[0m                                       [0m
[2K[2mPrepared [1m5 packages[0m [2min 3.45s[0m[0m                                             
[1m[31merror[39m[0m: Failed to install: threadpoolctl-3.6.0-py3-none-any.whl (threadpoolctl==3.6.0)
  [1m[31mCaused by[39m[0m: failed to create directory `/usr/loc

### Imports

In [57]:
import pandas as pd, matplotlib.pyplot as plt, seaborn as sns, numpy as np
from numpy import random
from tqdm import tqdm
from ipywidgets import interact
from pathlib import Path
import os, warnings, io, getpass, json, dash, modal, bentoml, gc, wandb, pickle, boto3, shutil
from joblib import dump, load
from dash import dcc, html, dash_table
import typing as t
from bentoml.validators import DataframeSchema
from fastapi import File, UploadFile, Form, HTTPException
import io
np.set_printoptions(linewidth=130)
plt.rc('image', cmap='Greys')
import sys
from modal import App, Volume, Image
import requests

## Exploratory Data Analysis

We have 3 datasets namely
- **kc_house_data.csv** ‚Äì Data for training the model
- **zipcode_demographics.csv** ‚Äì Additional demographic data from the U.S. Census which are used as features. This data should be joined to the primary home sales using the zipcode column.
- **future_unseen_examples.csv** ‚Äì This file contains examples of homes to be sold in the future. It includes all attributes from the original home sales file, but not the price , date , or id . It also does not include the demographic data.


Lets first take a look at our dataset

In [2]:
path = Path('..')
path

PosixPath('..')

In [3]:
!ls ../data

future_unseen_examples.csv  kc_house_data.csv  zipcode_demographics.csv


In [4]:
train_df = pd.read_csv(path/'data/kc_house_data.csv', index_col='id')
demographics_df = pd.read_csv(path/'data/zipcode_demographics.csv')
test_df = pd.read_csv(path/'data/future_unseen_examples.csv')

In [5]:
train_df

Unnamed: 0_level_0,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
7129300520,20141013T000000,221900.0,3,1.00,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.7210,-122.319,1690,7639
5631500400,20150225T000000,180000.0,2,1.00,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
2487200875,20141209T000000,604000.0,4,3.00,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
1954400510,20150218T000000,510000.0,3,2.00,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
263000018,20140521T000000,360000.0,3,2.50,1530,1131,3.0,0,0,3,8,1530,0,2009,0,98103,47.6993,-122.346,1530,1509
6600060120,20150223T000000,400000.0,4,2.50,2310,5813,2.0,0,0,3,8,2310,0,2014,0,98146,47.5107,-122.362,1830,7200
1523300141,20140623T000000,402101.0,2,0.75,1020,1350,2.0,0,0,3,7,1020,0,2009,0,98144,47.5944,-122.299,1020,2007
291310100,20150116T000000,400000.0,3,2.50,1600,2388,2.0,0,0,3,8,1600,0,2004,0,98027,47.5345,-122.069,1410,1287


In [6]:
#train_df??

In [7]:
train_df.columns

Index(['date', 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',
       'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long',
       'sqft_living15', 'sqft_lot15'],
      dtype='object')

In [8]:
test_df.columns

Index(['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long',
       'sqft_living15', 'sqft_lot15'],
      dtype='object')

In [9]:
demographics_df.columns

Index(['ppltn_qty', 'urbn_ppltn_qty', 'sbrbn_ppltn_qty', 'farm_ppltn_qty',
       'non_farm_qty', 'medn_hshld_incm_amt', 'medn_incm_per_prsn_amt',
       'hous_val_amt', 'edctn_less_than_9_qty', 'edctn_9_12_qty',
       'edctn_high_schl_qty', 'edctn_some_clg_qty', 'edctn_assoc_dgre_qty',
       'edctn_bchlr_dgre_qty', 'edctn_prfsnl_qty', 'per_urbn', 'per_sbrbn',
       'per_farm', 'per_non_farm', 'per_less_than_9', 'per_9_to_12', 'per_hsd',
       'per_some_clg', 'per_assoc', 'per_bchlr', 'per_prfsnl', 'zipcode'],
      dtype='object')

In [10]:
!ls ../model

model.pkl  model_features.json


In [11]:
import json
from pathlib import Path

# Check different possible locations
possible_paths = [
    Path("model_features.json"),  # current directory
    Path("model/model_features.json"),  # model subdirectory
    Path("../model/model_features.json"),  # parent directory
    Path("data/model_features.json"),  # data directory if you have one
]

for path in possible_paths:
    if path.exists():
        print(f"Found features file at: {path}")
        model_features = json.loads(path.read_text())
        print("Model features:", model_features)
        break
else:
    print("model_features.json not found in any expected location")

Found features file at: ../model/model_features.json
Model features: ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'sqft_above', 'sqft_basement', 'ppltn_qty', 'urbn_ppltn_qty', 'sbrbn_ppltn_qty', 'farm_ppltn_qty', 'non_farm_qty', 'medn_hshld_incm_amt', 'medn_incm_per_prsn_amt', 'hous_val_amt', 'edctn_less_than_9_qty', 'edctn_9_12_qty', 'edctn_high_schl_qty', 'edctn_some_clg_qty', 'edctn_assoc_dgre_qty', 'edctn_bchlr_dgre_qty', 'edctn_prfsnl_qty', 'per_urbn', 'per_sbrbn', 'per_farm', 'per_non_farm', 'per_less_than_9', 'per_9_to_12', 'per_hsd', 'per_some_clg', 'per_assoc', 'per_bchlr', 'per_prfsnl']


In [12]:
len(train_df.columns),len(test_df.columns)

(20, 18)

In [13]:
demographics_df

Unnamed: 0,ppltn_qty,urbn_ppltn_qty,sbrbn_ppltn_qty,farm_ppltn_qty,non_farm_qty,medn_hshld_incm_amt,medn_incm_per_prsn_amt,hous_val_amt,edctn_less_than_9_qty,edctn_9_12_qty,...,per_farm,per_non_farm,per_less_than_9,per_9_to_12,per_hsd,per_some_clg,per_assoc,per_bchlr,per_prfsnl,zipcode
0,38249.0,37394.0,0.0,0.0,855.0,66051.0,25219.0,192000.0,437.0,2301.0,...,0.0,2.0,1.0,6.0,18.0,20.0,5.0,12.0,4.0,98042
1,22036.0,22036.0,0.0,0.0,0.0,91904.0,53799.0,573900.0,149.0,404.0,...,0.0,0.0,0.0,1.0,6.0,12.0,3.0,27.0,22.0,98040
2,18194.0,18194.0,0.0,0.0,0.0,61813.0,31765.0,246600.0,269.0,905.0,...,0.0,0.0,1.0,4.0,13.0,20.0,6.0,19.0,9.0,98028
3,21956.0,21956.0,0.0,0.0,0.0,47461.0,22158.0,175400.0,925.0,1773.0,...,0.0,0.0,4.0,8.0,20.0,21.0,5.0,12.0,4.0,98178
4,22814.0,22814.0,0.0,0.0,0.0,48606.0,28398.0,252600.0,599.0,1148.0,...,0.0,0.0,2.0,5.0,13.0,17.0,5.0,23.0,12.0,98007
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65,35140.0,35021.0,0.0,0.0,119.0,81929.0,41856.0,335900.0,212.0,865.0,...,0.0,0.0,0.0,2.0,8.0,15.0,4.0,27.0,15.0,98006
66,23926.5,23298.0,0.0,0.0,0.0,56933.0,27639.5,239850.0,406.0,1213.0,...,0.0,0.0,1.0,5.0,15.0,19.0,5.0,19.0,7.5,98074
67,23926.5,23298.0,0.0,0.0,0.0,56933.0,27639.5,239850.0,406.0,1213.0,...,0.0,0.0,1.0,5.0,15.0,19.0,5.0,19.0,7.5,98077
68,23926.5,23298.0,0.0,0.0,0.0,56933.0,27639.5,239850.0,406.0,1213.0,...,0.0,0.0,1.0,5.0,15.0,19.0,5.0,19.0,7.5,98030


In [14]:
test_df

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,4,1.00,1680,5043,1.5,0,0,4,6,1680,0,1911,0,98118,47.5354,-122.273,1560,5765
1,3,2.50,2220,6380,1.5,0,0,4,8,1660,560,1931,0,98115,47.6974,-122.313,950,6380
2,3,2.25,1630,10962,1.0,0,0,4,8,1100,530,1977,0,98030,47.3801,-122.166,1830,8470
3,5,2.50,1710,9720,2.0,0,0,4,8,1710,0,1974,0,98005,47.5903,-122.157,2270,9672
4,2,1.00,850,6370,1.0,0,0,3,6,850,0,1951,0,98126,47.5198,-122.373,850,5170
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,3,2.50,2430,54059,2.0,0,0,3,10,2430,0,1987,0,98027,47.4664,-121.992,2910,49658
96,2,2.50,1240,1249,3.0,0,0,3,8,1240,0,2006,0,98107,47.6718,-122.386,1240,2500
97,4,1.75,1860,9750,1.0,0,0,3,7,1460,400,1969,0,98034,47.7097,-122.202,1900,8913
98,5,1.75,2330,3800,1.5,0,0,3,7,1360,970,1927,0,98115,47.6835,-122.308,2100,3800


In [15]:
!ls model

ls: cannot access 'model': No such file or directory


In [16]:
model_path = path/'model/model.pkl'

In [17]:
def load_model(model_path="../model/model.pkl"):
    """
    Function to load the model

    Args
    model_path: the path to the model
    """
    if os.path.exists(model_path):
        print(f"Loading existing model from pickle at {model_path}")
        with open(model_path, 'rb') as f:
            model = pickle.load(f)
        return model
    else:
        print(f"Model file not found at {model_path}")
        return None

In [18]:
load_model

<function __main__.load_model(model_path='../model/model.pkl')>

In [19]:
load_model??

[31mSignature:[39m load_model(model_path=[33m'../model/model.pkl'[39m)
[31mSource:[39m   
[38;5;28;01mdef[39;00m load_model(model_path=[33m"../model/model.pkl"[39m):
    [33m"""[39m
[33m    Function to load the model[39m

[33m    Args[39m
[33m    model_path: the path to the model[39m
[33m    """[39m
    [38;5;28;01mif[39;00m os.path.exists(model_path):
        print(f"Loading existing model from pickle at {model_path}")
        [38;5;28;01mwith[39;00m open(model_path, [33m'rb'[39m) [38;5;28;01mas[39;00m f:
            model = pickle.load(f)
        [38;5;28;01mreturn[39;00m model
    [38;5;28;01melse[39;00m:
        print(f"Model file not found at {model_path}")
        [38;5;28;01mreturn[39;00m [38;5;28;01mNone[39;00m
[31mFile:[39m      /tmp/ipykernel_111128/2359825133.py
[31mType:[39m      function

In [20]:
model = load_model()
model

Loading existing model from pickle at ../model/model.pkl


0,1,2
,steps,"[('robustscaler', ...), ('kneighborsregressor', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,with_centering,True
,with_scaling,True
,quantile_range,"(25.0, ...)"
,copy,True
,unit_variance,False

0,1,2
,n_neighbors,5
,weights,'uniform'
,algorithm,'auto'
,leaf_size,30
,p,2
,metric,'minkowski'
,metric_params,
,n_jobs,


In [21]:
#model.predict(test_df)

## Trial solution 1 - chatgpt

In [22]:
import pandas as pd
import numpy as np

# --------------------------
# Example: model_expected list
# --------------------------
model_expected = [
    'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
    'sqft_above', 'sqft_basement',
    'ppltn_qty', 'urbn_ppltn_qty', 'sbrbn_ppltn_qty', 'farm_ppltn_qty', 'non_farm_qty',
    'medn_hshld_incm_amt', 'medn_incm_per_prsn_amt', 'hous_val_amt',
    'edctn_less_than_9_qty', 'edctn_9_12_qty', 'edctn_high_schl_qty', 'edctn_some_clg_qty',
    'edctn_assoc_dgre_qty', 'edctn_bchlr_dgre_qty', 'edctn_prfsnl_qty',
    'per_urbn', 'per_sbrbn', 'per_farm', 'per_non_farm',
    'per_less_than_9', 'per_9_to_12', 'per_hsd', 'per_some_clg',
    'per_assoc', 'per_bchlr', 'per_prfsnl'
]

# --------------------------
# 1. Merge housing with demographics
# --------------------------
# Ensure zipcodes are the same dtype (string recommended to preserve leading zeros)
test_df['zipcode'] = test_df['zipcode'].astype(str)
demographics_df['zipcode'] = demographics_df['zipcode'].astype(str)

# Merge on zipcode
inference_df = test_df.merge(demographics_df, on='zipcode', how='left')
inference_df

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,...,per_sbrbn,per_farm,per_non_farm,per_less_than_9,per_9_to_12,per_hsd,per_some_clg,per_assoc,per_bchlr,per_prfsnl
0,4,1.00,1680,5043,1.5,0,0,4,6,1680,...,0.0,0.0,0.0,9.0,9.0,17.0,15.0,4.0,11.0,6.0
1,3,2.50,2220,6380,1.5,0,0,4,8,1660,...,0.0,0.0,0.0,0.0,2.0,8.0,15.0,4.0,30.0,20.0
2,3,2.25,1630,10962,1.0,0,0,4,8,1100,...,0.0,0.0,0.0,1.0,5.0,15.0,19.0,5.0,19.0,7.5
3,5,2.50,1710,9720,2.0,0,0,4,8,1710,...,0.0,0.0,0.0,2.0,3.0,10.0,17.0,4.0,26.0,16.0
4,2,1.00,850,6370,1.0,0,0,3,6,850,...,0.0,0.0,0.0,4.0,7.0,16.0,19.0,5.0,16.0,7.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,3,2.50,2430,54059,2.0,0,0,3,10,2430,...,0.0,0.0,19.0,1.0,3.0,12.0,17.0,6.0,24.0,11.0
96,2,2.50,1240,1249,3.0,0,0,3,8,1240,...,0.0,0.0,0.0,1.0,5.0,14.0,20.0,5.0,28.0,11.0
97,4,1.75,1860,9750,1.0,0,0,3,7,1460,...,0.0,0.0,0.0,1.0,4.0,14.0,21.0,6.0,20.0,8.0
98,5,1.75,2330,3800,1.5,0,0,3,7,1360,...,0.0,0.0,0.0,0.0,2.0,8.0,15.0,4.0,30.0,20.0


In [23]:
test_df.shape,demographics_df.shape,inference_df.shape

((100, 18), (70, 27), (100, 44))

In [24]:
inference_df.columns

Index(['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long',
       'sqft_living15', 'sqft_lot15', 'ppltn_qty', 'urbn_ppltn_qty',
       'sbrbn_ppltn_qty', 'farm_ppltn_qty', 'non_farm_qty',
       'medn_hshld_incm_amt', 'medn_incm_per_prsn_amt', 'hous_val_amt',
       'edctn_less_than_9_qty', 'edctn_9_12_qty', 'edctn_high_schl_qty',
       'edctn_some_clg_qty', 'edctn_assoc_dgre_qty', 'edctn_bchlr_dgre_qty',
       'edctn_prfsnl_qty', 'per_urbn', 'per_sbrbn', 'per_farm', 'per_non_farm',
       'per_less_than_9', 'per_9_to_12', 'per_hsd', 'per_some_clg',
       'per_assoc', 'per_bchlr', 'per_prfsnl'],
      dtype='object')

In [25]:

# --------------------------
# 2. Check coverage of expected features
# --------------------------
missing_features = set(model_expected) - set(inference_df.columns)
if missing_features:
    print("WARNING: These expected features are missing after merge:", missing_features)
else:
    print("All expected features are present after merge.")


All expected features are present after merge.


In [26]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   bedrooms       100 non-null    int64  
 1   bathrooms      100 non-null    float64
 2   sqft_living    100 non-null    int64  
 3   sqft_lot       100 non-null    int64  
 4   floors         100 non-null    float64
 5   waterfront     100 non-null    int64  
 6   view           100 non-null    int64  
 7   condition      100 non-null    int64  
 8   grade          100 non-null    int64  
 9   sqft_above     100 non-null    int64  
 10  sqft_basement  100 non-null    int64  
 11  yr_built       100 non-null    int64  
 12  yr_renovated   100 non-null    int64  
 13  zipcode        100 non-null    object 
 14  lat            100 non-null    float64
 15  long           100 non-null    float64
 16  sqft_living15  100 non-null    int64  
 17  sqft_lot15     100 non-null    int64  
dtypes: float64(

In [27]:
test_df.isnull().sum()

bedrooms         0
bathrooms        0
sqft_living      0
sqft_lot         0
floors           0
waterfront       0
view             0
condition        0
grade            0
sqft_above       0
sqft_basement    0
yr_built         0
yr_renovated     0
zipcode          0
lat              0
long             0
sqft_living15    0
sqft_lot15       0
dtype: int64

In [28]:

# --------------------------
# 3. Impute missing values (basic strategy: median imputation)
# --------------------------
# (In production you should use the exact imputation strategy/scaler from training.)
for col in model_expected:
    if inference_df[col].isnull().any():
        median_val = inference_df[col].median()
        inference_df[col] = inference_df[col].fillna(median_val)
        print(f"Filled NaNs in {col} with median {median_val}")

In [29]:

# --------------------------
# 4. Reorder columns to match model input order
# --------------------------
inference_df = inference_df[model_expected]
inference_df


Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,sqft_above,sqft_basement,ppltn_qty,urbn_ppltn_qty,sbrbn_ppltn_qty,...,per_sbrbn,per_farm,per_non_farm,per_less_than_9,per_9_to_12,per_hsd,per_some_clg,per_assoc,per_bchlr,per_prfsnl
0,4,1.00,1680,5043,1.5,1680,0,40409.0,40409.0,0.0,...,0.0,0.0,0.0,9.0,9.0,17.0,15.0,4.0,11.0,6.0
1,3,2.50,2220,6380,1.5,1660,560,43263.0,43263.0,0.0,...,0.0,0.0,0.0,0.0,2.0,8.0,15.0,4.0,30.0,20.0
2,3,2.25,1630,10962,1.0,1100,530,23926.5,23298.0,0.0,...,0.0,0.0,0.0,1.0,5.0,15.0,19.0,5.0,19.0,7.5
3,5,2.50,1710,9720,2.0,1710,0,17150.0,17150.0,0.0,...,0.0,0.0,0.0,2.0,3.0,10.0,17.0,4.0,26.0,16.0
4,2,1.00,850,6370,1.0,850,0,19435.0,19435.0,0.0,...,0.0,0.0,0.0,4.0,7.0,16.0,19.0,5.0,16.0,7.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,3,2.50,2430,54059,2.0,2430,0,22271.0,18009.0,0.0,...,0.0,0.0,19.0,1.0,3.0,12.0,17.0,6.0,24.0,11.0
96,2,2.50,1240,1249,3.0,1240,0,18314.0,18314.0,0.0,...,0.0,0.0,0.0,1.0,5.0,14.0,20.0,5.0,28.0,11.0
97,4,1.75,1860,9750,1.0,1460,400,40127.0,40127.0,0.0,...,0.0,0.0,0.0,1.0,4.0,14.0,21.0,6.0,20.0,8.0
98,5,1.75,2330,3800,1.5,1360,970,43263.0,43263.0,0.0,...,0.0,0.0,0.0,0.0,2.0,8.0,15.0,4.0,30.0,20.0


### How reodering works

In [30]:
# --------------------------
# 4. Reorder columns to match model input order
# --------------------------
inference_df_d = test_df.merge(demographics_df, on='zipcode', how='left').copy()
inference_df_a = inference_df_d.copy()
inference_df_b = inference_df_a[['bedrooms','floors','zipcode']]
inference_df_b

Unnamed: 0,bedrooms,floors,zipcode
0,4,1.5,98118
1,3,1.5,98115
2,3,1.0,98030
3,5,2.0,98005
4,2,1.0,98126
...,...,...,...
95,3,2.0,98027
96,2,3.0,98107
97,4,1.0,98034
98,5,1.5,98115


In [31]:
# --------------------------
# 5. Type consistency: ensure all numeric
# --------------------------
#inference_df = inference_df.apply(pd.to_numeric, errors='coerce')

# --------------------------
# 6. Final sanity checks
# --------------------------
print("Final shape:", inference_df.shape)
print("Any NaNs left?", inference_df.isna().any().any())

# Now you can feed inference_df into your trained model:
# preds = model.predict(inference_df)

Final shape: (100, 33)
Any NaNs left? False


In [32]:
inference_df.isna().any().any()

False

In [33]:
predictions = model.predict(inference_df)
predictions.shape

(100,)

In [34]:
predictions

array([ 458520. ,  612800. ,  449160. ,  679700. ,  304256. ,  553798. ,  341800. ,  445350. ,  990500. ,  532940. ,  422700. ,
        484220. ,  499400. ,  358470. ,  790700. ,  236300. ,  426950. ,  687600. ,  619880. ,  438000. ,  520800. ,  669300.2,
        549036. ,  411100. ,  250190. ,  313590. ,  730800. ,  285730. ,  256990. ,  390200. ,  285942.4,  865700. ,  975500. ,
        494936. ,  272090. ,  297900. ,  302298. ,  612000. ,  222590. ,  297940. ,  213800. ,  796988. ,  407260. ,  307300. ,
        451000. ,  263660. ,  297560. ,  658200. ,  261500. ,  288890. , 1241796. ,  279380. ,  252390. ,  252980. ,  569370. ,
        524790. ,  602670. ,  427900. ,  406000. ,  890000. ,  486090. ,  317402. ,  886700. ,  421650. ,  321999. ,  390360. ,
        486980. ,  499000. ,  344200. ,  558650. ,  264590. ,  711190. ,  259930. ,  614000. ,  424089.8,  522800. ,  520300. ,
        412600. ,  830000. ,  258906. ,  726500. ,  565600. ,  220941.6,  404500. ,  412002.8,  795932. 

In [35]:
inference_df_c = inference_df.copy()
inference_df_c.shape

(100, 33)

In [36]:
preds = model.predict(inference_df_c)
inference_df_c["predicted_price"] = preds
inference_df_c.to_csv("predictions.csv", index=False)
print("Predictions written to predictions.csv")

Predictions written to predictions.csv


In [37]:
!ls

predictions.csv  sound_realty.ipynb


In [38]:
sub_df = pd.read_csv('predictions.csv')
sub_df

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,sqft_above,sqft_basement,ppltn_qty,urbn_ppltn_qty,sbrbn_ppltn_qty,...,per_farm,per_non_farm,per_less_than_9,per_9_to_12,per_hsd,per_some_clg,per_assoc,per_bchlr,per_prfsnl,predicted_price
0,4,1.00,1680,5043,1.5,1680,0,40409.0,40409.0,0.0,...,0.0,0.0,9.0,9.0,17.0,15.0,4.0,11.0,6.0,458520.0
1,3,2.50,2220,6380,1.5,1660,560,43263.0,43263.0,0.0,...,0.0,0.0,0.0,2.0,8.0,15.0,4.0,30.0,20.0,612800.0
2,3,2.25,1630,10962,1.0,1100,530,23926.5,23298.0,0.0,...,0.0,0.0,1.0,5.0,15.0,19.0,5.0,19.0,7.5,449160.0
3,5,2.50,1710,9720,2.0,1710,0,17150.0,17150.0,0.0,...,0.0,0.0,2.0,3.0,10.0,17.0,4.0,26.0,16.0,679700.0
4,2,1.00,850,6370,1.0,850,0,19435.0,19435.0,0.0,...,0.0,0.0,4.0,7.0,16.0,19.0,5.0,16.0,7.0,304256.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,3,2.50,2430,54059,2.0,2430,0,22271.0,18009.0,0.0,...,0.0,19.0,1.0,3.0,12.0,17.0,6.0,24.0,11.0,535800.0
96,2,2.50,1240,1249,3.0,1240,0,18314.0,18314.0,0.0,...,0.0,0.0,1.0,5.0,14.0,20.0,5.0,28.0,11.0,452800.0
97,4,1.75,1860,9750,1.0,1460,400,40127.0,40127.0,0.0,...,0.0,0.0,1.0,4.0,14.0,21.0,6.0,20.0,8.0,471817.0
98,5,1.75,2330,3800,1.5,1360,970,43263.0,43263.0,0.0,...,0.0,0.0,0.0,2.0,8.0,15.0,4.0,30.0,20.0,609388.0


In [39]:
!ls ../data

future_unseen_examples.csv  kc_house_data.csv  zipcode_demographics.csv


In [40]:
#eos

## API 

We are now going to start by uploading our data to modal using volumes. To quote the modal documentation

Modal Volumes provide a high-performance distributed file system for your modal applications. They are designed for write-once, read-many I/O workloads, like creating machine learning model weights and distributing them for inference.

Uploading our data will enable our training function that we run later to access the data it will need to train our machine learning model.

In [45]:
!modal setup

[2K[31mWas not able to launch web browser[0me web browserer
Please go to this URL manually and complete the flow:

[2K]8;id=813487;https://modal.com/token-flow/tf-rmhKHkVFA9EUanJtfpJem8\[4;94mhttps://modal.com/token-flow/tf-rmhKHkVFA9EUanJtfpJem8[0m]8;;\

[2K[32m‚†¶[0m Waiting for authentication in the web browser
[2K[32m‚†¥[0m Waiting for token flow to complete...omplete...
[1A[2K[32mWeb authentication finished successfully![0m
[32mToken is connected to the [0m[35mflexible-functions-ai[0m[32m workspace.[0m
Verifying token against [4;34mhttps://api.modal.com[0m
[32mToken verified successfully![0m
[?25l[32m‚†ã[0m Storing token
[1A[2K[32mToken written to [0m[35m/home/zicofeadmin/[0m[35m.modal.toml[0m[32m in profile [0m[35mflexible-functions-ai[0m[32m.[0m


## Data Upload

In [46]:
!ls ../data

future_unseen_examples.csv  kc_house_data.csv  zipcode_demographics.csv


In [52]:
# Modal app
app = App("sr-hybrid-upload-app")

# Persistent Modal volume
volume = Volume.from_name("sr-hybrid-app-volume", create_if_missing=True)

# Image with boto3 for S3 + optional baked-in dataset + baked-in model artifacts
image = (
    Image.debian_slim()
    .pip_install("boto3")
    .add_local_dir("../data", "/frozen_data")               # dataset (optional frozen)
    .add_local_dir("../model", "/frozen_model")        # model artifacts (optional frozen)
)


@app.function(volumes={"/data": volume}, image=image)
def upload_all(local_dirs: dict = None, s3_bucket: str = None, s3_prefix: str = None, use_frozen: bool = False):
    """
    Uploads training data + model artifacts into the Modal volume.

    - local_dirs: dict of {"remote_subdir": "local_path"} (dev mode)
    - s3_bucket + s3_prefix: fetch from S3 (prod mode)
    - use_frozen: copy pre-baked datasets + models
    """
    os.makedirs("/data", exist_ok=True)

    if local_dirs:
        # Dev mode
        for subdir, local_path in local_dirs.items():
            dest_dir = Path(f"/data/{subdir}")
            dest_dir.mkdir(parents=True, exist_ok=True)

            for file in Path(local_path).glob("*"):
                if file.is_file():
                    shutil.copy(file, dest_dir / file.name)
                    print(f"[DEV] Copied {file} -> {dest_dir / file.name}")

    elif s3_bucket and s3_prefix:
        # Prod mode (fetching from S3)
        s3 = boto3.client("s3")
        result = s3.list_objects_v2(Bucket=s3_bucket, Prefix=s3_prefix)

        for obj in result.get("Contents", []):
            key = obj["Key"]
            filename = os.path.basename(key)
            subdir = os.path.dirname(key).split("/")[-1]  # e.g. "model" or "data"
            dest_dir = Path(f"/data/{subdir}")
            dest_dir.mkdir(parents=True, exist_ok=True)

            dest = dest_dir / filename
            s3.download_file(s3_bucket, key, str(dest))
            print(f"[PROD] Downloaded s3://{s3_bucket}/{key} -> {dest}")

    elif use_frozen:
        # Frozen mode (both datasets + model artifacts baked in)
        for folder, frozen_path in [("data", "/frozen_data"), ("model", "/frozen_model")]:
            dest_dir = Path(f"/data/{folder}")
            dest_dir.mkdir(parents=True, exist_ok=True)

            for file in Path(frozen_path).glob("*"):
                if file.is_file():
                    shutil.copy(file, dest_dir / file.name)
                    print(f"[FROZEN] Copied {file} -> {dest_dir / file.name}")

    else:
        print("‚ö†Ô∏è No source provided. Pass local_dirs, or s3_bucket+s3_prefix, or use_frozen=True.")

    # Confirm
    print("\nFiles now in Modal volume:")
    for file in Path("/data").rglob("*"):
        print(f" - {file}")


# === Notebook/CLI helpers ===

def run_upload_local():
    """Upload local dataset + model artifacts (dev mode)."""
    local_dirs = {
        "data": "../data",                  # training data
        "model": "../model"            # model artifacts (pkl, json, etc.)
    }
    with app.run():
        upload_all.remote(local_dirs=local_dirs)


def run_upload_s3(bucket, prefix):
    """Upload from S3 (prod mode)."""
    with app.run():
        upload_all.remote(s3_bucket=bucket, s3_prefix=prefix)


def run_upload_frozen():
    """Upload frozen dataset + model artifacts (frozen mode)."""
    with app.run():
        upload_all.remote(use_frozen=True)


In [53]:
run_upload_frozen()

In [54]:
!ls ../model/

model.pkl  model_features.json


## Model Serving

In [55]:
class ModalAPITester:
    def __init__(self, base_url=None):
        """
        Initialize the tester with base URL
        You'll need to update these URLs after deployment
        """
        if base_url:
            self.health_url = f"{https://flexible-functions-ai--sr-hybrid-sales-api}-health.modal.run"
            self.predict_url = f"{https://flexible-functions-ai--sr-hybrid-sales-api}-predict-csv.modal.run"
        else:
            # You'll need to replace these with your actual deployed URLs
            self.health_url = "https://flexible-functions-ai--sr-hybrid-sales-api-health.modal.run"
            self.predict_url = "{https://flexible-functions-ai--sr-hybrid-sales-api-predict-csv.modal.run"
    
    def test_health_endpoint(self):
        """Test the health check endpoint"""
        print("=" * 50)
        print("Testing Health Endpoint")
        print("=" * 50)
        
        try:
            response = requests.get(self.health_url, timeout=10)
            
            print(f"Status Code: {response.status_code}")
            print(f"Response: {response.json()}")
            
            if response.status_code == 200:
                print("‚úÖ Health endpoint is working!")
                return True
            else:
                print("‚ùå Health endpoint failed!")
                return False
                
        except requests.exceptions.RequestException as e:
            print(f"‚ùå Error connecting to health endpoint: {str(e)}")
            print("Make sure your Modal app is deployed and the URL is correct")
            return False
    
    def test_predict_endpoint(self, csv_file_path=None):
        """Test the CSV prediction endpoint"""
        print("\n" + "=" * 50)
        print("Testing Prediction Endpoint")
        print("=" * 50)
        
        # Default test data
        if csv_file_path is None:
            csv_file_path = self.create_test_csv()
        
        try:
            # Check if file exists
            if not Path(csv_file_path).exists():
                print(f"‚ùå Test file not found: {csv_file_path}")
                return False
            
            print(f"Using test file: {csv_file_path}")
            
            # Read and display file info
            test_df = pd.read_csv(csv_file_path)
            print(f"Test data shape: {test_df.shape}")
            print(f"Test data columns: {test_df.columns.tolist()}")
            print(f"First few rows:\n{test_df.head()}")
            
            # Prepare the file for upload
            with open(csv_file_path, 'rb') as f:
                files = {'file': ('test_data.csv', f, 'text/csv')}
                
                print(f"\nSending request to: {self.predict_url}")
                print("This might take a moment...")
                
                # Make the request
                response = requests.post(
                    self.predict_url, 
                    files=files,
                    timeout=60  # Increase timeout for model inference
                )
            
            print(f"Status Code: {response.status_code}")
            
            if response.status_code == 200:
                try:
                    result = response.json()
                    
                    if isinstance(result, list):
                        # Direct predictions
                        predictions = result
                        print(f"‚úÖ Predictions received!")
                        print(f"Number of predictions: {len(predictions)}")
                        print(f"First 5 predictions: {predictions[:5]}")
                        print(f"Prediction range: {min(predictions):.2f} to {max(predictions):.2f}")
                        return True
                        
                    elif isinstance(result, dict) and result.get('success') == False:
                        # Error response
                        print(f"‚ùå API returned error: {result.get('error')}")
                        if 'traceback' in result:
                            print(f"Traceback: {result['traceback']}")
                        return False
                        
                    else:
                        # Structured response
                        print(f"Response: {result}")
                        return True
                        
                except json.JSONDecodeError:
                    print(f"‚ùå Could not parse JSON response: {response.text}")
                    return False
            else:
                print(f"‚ùå Request failed with status {response.status_code}")
                print(f"Response: {response.text}")
                return False
                
        except requests.exceptions.Timeout:
            print("‚ùå Request timed out. The model might be taking too long to respond.")
            return False
        except requests.exceptions.RequestException as e:
            print(f"‚ùå Error making request: {str(e)}")
            return False
    
    def create_test_csv(self):
        """Create a simple test CSV if none provided"""
        test_data = {
            'bedrooms': [3, 4, 2, 5],
            'bathrooms': [2.0, 3.0, 1.5, 2.5],
            'sqft_living': [1500, 2000, 1200, 2500],
            'sqft_lot': [5000, 6000, 4000, 7000],
            'floors': [1, 2, 1, 2],
            'zipcode': [98001, 98002, 98003, 98004]  # Assuming these exist in demographics
        }
        
        test_df = pd.DataFrame(test_data)
        test_file = 'test_predictions.csv'
        test_df.to_csv(test_file, index=False)
        
        print(f"Created test CSV: {test_file}")
        return test_file
    
    def run_full_test(self, csv_file_path=None):
        """Run complete test suite"""
        print("üöÄ Starting Modal API Tests")
        print(f"Health URL: {self.health_url}")
        print(f"Predict URL: {self.predict_url}")
        
        # Test 1: Health check
        health_passed = self.test_health_endpoint()
        
        if not health_passed:
            print("\n‚ùå Health check failed. Skipping prediction test.")
            return False
        
        # Test 2: Prediction endpoint
        predict_passed = self.test_predict_endpoint(csv_file_path)
        
        # Summary
        print("\n" + "=" * 50)
        print("TEST SUMMARY")
        print("=" * 50)
        print(f"Health Endpoint: {'‚úÖ PASS' if health_passed else '‚ùå FAIL'}")
        print(f"Predict Endpoint: {'‚úÖ PASS' if predict_passed else '‚ùå FAIL'}")
        
        if health_passed and predict_passed:
            print("\nüéâ All tests passed! Your API is working correctly.")
            return True
        else:
            print("\nüòû Some tests failed. Check the errors above.")
            return False

# Usage examples:
def test_with_default_data():
    """Test with automatically generated data"""
    tester = ModalAPITester()
    # Update URLs after deployment
    tester.health_url = "https://flexible-functions-ai--sr-hybrid-sales-api-health.modal.run"
    tester.predict_url = "https://flexible-functions-ai--sr-hybrid-sales-api-predict-csv.modal.run"
    
    return tester.run_full_test()

def test_with_your_data():
    """Test with your actual test data"""
    tester = ModalAPITester()
    # Update URLs after deployment
    tester.health_url = "https://flexible-functions-ai--sr-hybrid-sales-api-health.modal.run"
    tester.predict_url = "https://flexible-functions-ai--sr-hybrid-sales-api-predict-csv.modal.run"
    
    # Use your actual test file
    csv_file = "../data/future_unseen_examples.csv"  # Adjust path as needed
    return tester.run_full_test(csv_file)

def quick_test():
    """Quick test function for notebook use"""
    # Replace these URLs with your actual deployment URLs
    health_url = "https://flexible-functions-ai--sr-hybrid-sales-api-health.modal.run/"
    predict_url = "https://flexible-functions-ai--sr-hybrid-sales-api-predict-csv.modal.run"
    
    tester = ModalAPITester()
    tester.health_url = health_url
    tester.predict_url = predict_url
    
    return tester.run_full_test()

if __name__ == "__main__":
    # Run the test
    print("Modal API Tester")
    print("Remember to update the URLs with your actual deployed endpoints!")
    
    # Create tester instance
    tester = ModalAPITester()
    
    # You MUST update these URLs after deployment
    print("\n‚ö†Ô∏è  IMPORTANT: Update these URLs with your actual deployment URLs:")
    print(f"Current health URL: {tester.health_url}")
    print(f"Current predict URL: {tester.predict_url}")
    
    # Uncomment to run with default test data:
    #test_with_default_data()
    
    # Uncomment to run with your actual data:
    # test_with_your_data()

Modal API Tester
Remember to update the URLs with your actual deployed endpoints!

‚ö†Ô∏è  IMPORTANT: Update these URLs with your actual deployment URLs:
Current health URL: https://flexible-functions-ai--sr-hybrid-sales-api-health.modal.run
Current predict URL: {https://flexible-functions-ai--sr-hybrid-sales-api-predict-csv.modal.run


In [58]:
test_with_default_data()

üöÄ Starting Modal API Tests
Health URL: https://flexible-functions-ai--sr-hybrid-sales-api-health.modal.run/
Predict URL: https://flexible-functions-ai--sr-hybrid-sales-api-predict-csv.modal.run
Testing Health Endpoint
‚ùå Error connecting to health endpoint: HTTPSConnectionPool(host='flexible-functions-ai--sr-hybrid-sales-api-health.modal.run', port=443): Read timed out. (read timeout=10)
Make sure your Modal app is deployed and the URL is correct

‚ùå Health check failed. Skipping prediction test.


False

In [60]:
test_with_your_data()

üöÄ Starting Modal API Tests
Health URL: https://flexible-functions-ai--sr-hybrid-sales-api-health.modal.run/
Predict URL: https://flexible-functions-ai--sr-hybrid-sales-api-predict-csv.modal.run
Testing Health Endpoint
‚ùå Error connecting to health endpoint: HTTPSConnectionPool(host='flexible-functions-ai--sr-hybrid-sales-api-health.modal.run', port=443): Read timed out. (read timeout=10)
Make sure your Modal app is deployed and the URL is correct

‚ùå Health check failed. Skipping prediction test.


False