# Predicting real estate prices - Model serving and deployment

## Background
Sound Realty helps people sell homes in the Seattle area.

They currently spend too much time and effort on estimating the value of properties.

One of their staff has heard a lot about machine learning (ML) and has created a basic model to estimate the value of properties.

The basic model uses only numeric variables and ignores some other attributes.
Despite the simplicity of this model, the folks at Sound are impressed with the proof of concept and would now like to use this model to streamline
their business.

They have contracted us to help deploy that model for broader use.
Our job is to create a REST endpoint that serves up model predictions for new data, and to provide guidance on how they could improve the model.

## Proposed Solution

Here I shall deploy the model to a REST endpoint using Modal.


## Library Installation and Import

Below I shall install then import the libraries needed.

In [2]:
!pip install uv

Collecting uv
  Downloading uv-0.8.19-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Downloading uv-0.8.19-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (20.9 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m20.9/20.9 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: uv
Successfully installed uv-0.8.19


In [80]:
#%pip install seaborn tqdm sweetviz dash streamlit plotly requests gradio joblib scikit-learn ipywidgets modal bentoml wandb

In [1]:
%uv pip install seaborn tqdm sweetviz dash streamlit plotly requests gradio
%uv pip install joblib scikit-learn ipywidgets modal bentoml wandb

[2mUsing Python 3.10.12 environment at: /usr[0m
[2K[2mResolved [1m96 packages[0m [2min 2.68s[0m[0m                                        [0m
[2K[2mPrepared [1m6 packages[0m [2min 24.58s[0m[0m                                            
[1m[31merror[39m[0m: Failed to install: pyparsing-3.2.5-py3-none-any.whl (pyparsing==3.2.5)
  [1m[31mCaused by[39m[0m: failed to create directory `/usr/local/lib/python3.10/dist-packages/pyparsing-3.2.5.dist-info`: Permission denied (os error 13)
Note: you may need to restart the kernel to use updated packages.
[2mUsing Python 3.10.12 environment at: /usr[0m
[2K[2mResolved [1m117 packages[0m [2min 2.09s[0m[0m                                       [0m
[2K[2mPrepared [1m5 packages[0m [2min 3.45s[0m[0m                                             
[1m[31merror[39m[0m: Failed to install: threadpoolctl-3.6.0-py3-none-any.whl (threadpoolctl==3.6.0)
  [1m[31mCaused by[39m[0m: failed to create directory `/usr/loc

### Imports

In [1]:
import pandas as pd, matplotlib.pyplot as plt, seaborn as sns, numpy as np
from numpy import random
from tqdm import tqdm
from ipywidgets import interact
from pathlib import Path
import os, warnings, io, getpass, json, dash, modal, bentoml, gc, wandb, pickle
from joblib import dump, load
from dash import dcc, html, dash_table
import typing as t
from bentoml.validators import DataframeSchema
from fastapi import File, UploadFile, Form, HTTPException
import io
np.set_printoptions(linewidth=130)
plt.rc('image', cmap='Greys')
import sys

## Exploratory Data Analysis

We have 3 datasets namely
- **kc_house_data.csv** ‚Äì Data for training the model
- **zipcode_demographics.csv** ‚Äì Additional demographic data from the U.S. Census which are used as features. This data should be joined to the primary home sales using the zipcode column.
- **future_unseen_examples.csv** ‚Äì This file contains examples of homes to be sold in the future. It includes all attributes from the original home sales file, but not the price , date , or id . It also does not include the demographic data.


Lets first take a look at our dataset

In [2]:
path = Path('..')
path

PosixPath('..')

In [3]:
!ls ../data

future_unseen_examples.csv  kc_house_data.csv  zipcode_demographics.csv


In [5]:
train_df = pd.read_csv(path/'data/kc_house_data.csv', index_col='id')
demographics_df = pd.read_csv(path/'data/zipcode_demographics.csv')
test_df = pd.read_csv(path/'data/future_unseen_examples.csv')

In [6]:
train_df

Unnamed: 0_level_0,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
7129300520,20141013T000000,221900.0,3,1.00,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.7210,-122.319,1690,7639
5631500400,20150225T000000,180000.0,2,1.00,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
2487200875,20141209T000000,604000.0,4,3.00,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
1954400510,20150218T000000,510000.0,3,2.00,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
263000018,20140521T000000,360000.0,3,2.50,1530,1131,3.0,0,0,3,8,1530,0,2009,0,98103,47.6993,-122.346,1530,1509
6600060120,20150223T000000,400000.0,4,2.50,2310,5813,2.0,0,0,3,8,2310,0,2014,0,98146,47.5107,-122.362,1830,7200
1523300141,20140623T000000,402101.0,2,0.75,1020,1350,2.0,0,0,3,7,1020,0,2009,0,98144,47.5944,-122.299,1020,2007
291310100,20150116T000000,400000.0,3,2.50,1600,2388,2.0,0,0,3,8,1600,0,2004,0,98027,47.5345,-122.069,1410,1287


In [12]:
#train_df??

In [8]:
train_df.columns

Index(['date', 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',
       'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long',
       'sqft_living15', 'sqft_lot15'],
      dtype='object')

In [15]:
test_df.columns

Index(['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long',
       'sqft_living15', 'sqft_lot15'],
      dtype='object')

In [16]:
demographics_df.columns

Index(['ppltn_qty', 'urbn_ppltn_qty', 'sbrbn_ppltn_qty', 'farm_ppltn_qty',
       'non_farm_qty', 'medn_hshld_incm_amt', 'medn_incm_per_prsn_amt',
       'hous_val_amt', 'edctn_less_than_9_qty', 'edctn_9_12_qty',
       'edctn_high_schl_qty', 'edctn_some_clg_qty', 'edctn_assoc_dgre_qty',
       'edctn_bchlr_dgre_qty', 'edctn_prfsnl_qty', 'per_urbn', 'per_sbrbn',
       'per_farm', 'per_non_farm', 'per_less_than_9', 'per_9_to_12', 'per_hsd',
       'per_some_clg', 'per_assoc', 'per_bchlr', 'per_prfsnl', 'zipcode'],
      dtype='object')

In [17]:
!ls ../model

model.pkl  model_features.json


In [18]:
import json
from pathlib import Path

# Check different possible locations
possible_paths = [
    Path("model_features.json"),  # current directory
    Path("model/model_features.json"),  # model subdirectory
    Path("../model/model_features.json"),  # parent directory
    Path("data/model_features.json"),  # data directory if you have one
]

for path in possible_paths:
    if path.exists():
        print(f"Found features file at: {path}")
        model_features = json.loads(path.read_text())
        print("Model features:", model_features)
        break
else:
    print("model_features.json not found in any expected location")

Found features file at: ../model/model_features.json
Model features: ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'sqft_above', 'sqft_basement', 'ppltn_qty', 'urbn_ppltn_qty', 'sbrbn_ppltn_qty', 'farm_ppltn_qty', 'non_farm_qty', 'medn_hshld_incm_amt', 'medn_incm_per_prsn_amt', 'hous_val_amt', 'edctn_less_than_9_qty', 'edctn_9_12_qty', 'edctn_high_schl_qty', 'edctn_some_clg_qty', 'edctn_assoc_dgre_qty', 'edctn_bchlr_dgre_qty', 'edctn_prfsnl_qty', 'per_urbn', 'per_sbrbn', 'per_farm', 'per_non_farm', 'per_less_than_9', 'per_9_to_12', 'per_hsd', 'per_some_clg', 'per_assoc', 'per_bchlr', 'per_prfsnl']


In [37]:
len(train_df.columns),len(test_df.columns)

(20, 18)

In [23]:
demographics_df

Unnamed: 0,ppltn_qty,urbn_ppltn_qty,sbrbn_ppltn_qty,farm_ppltn_qty,non_farm_qty,medn_hshld_incm_amt,medn_incm_per_prsn_amt,hous_val_amt,edctn_less_than_9_qty,edctn_9_12_qty,...,per_farm,per_non_farm,per_less_than_9,per_9_to_12,per_hsd,per_some_clg,per_assoc,per_bchlr,per_prfsnl,zipcode
0,38249.0,37394.0,0.0,0.0,855.0,66051.0,25219.0,192000.0,437.0,2301.0,...,0.0,2.0,1.0,6.0,18.0,20.0,5.0,12.0,4.0,98042
1,22036.0,22036.0,0.0,0.0,0.0,91904.0,53799.0,573900.0,149.0,404.0,...,0.0,0.0,0.0,1.0,6.0,12.0,3.0,27.0,22.0,98040
2,18194.0,18194.0,0.0,0.0,0.0,61813.0,31765.0,246600.0,269.0,905.0,...,0.0,0.0,1.0,4.0,13.0,20.0,6.0,19.0,9.0,98028
3,21956.0,21956.0,0.0,0.0,0.0,47461.0,22158.0,175400.0,925.0,1773.0,...,0.0,0.0,4.0,8.0,20.0,21.0,5.0,12.0,4.0,98178
4,22814.0,22814.0,0.0,0.0,0.0,48606.0,28398.0,252600.0,599.0,1148.0,...,0.0,0.0,2.0,5.0,13.0,17.0,5.0,23.0,12.0,98007
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65,35140.0,35021.0,0.0,0.0,119.0,81929.0,41856.0,335900.0,212.0,865.0,...,0.0,0.0,0.0,2.0,8.0,15.0,4.0,27.0,15.0,98006
66,23926.5,23298.0,0.0,0.0,0.0,56933.0,27639.5,239850.0,406.0,1213.0,...,0.0,0.0,1.0,5.0,15.0,19.0,5.0,19.0,7.5,98074
67,23926.5,23298.0,0.0,0.0,0.0,56933.0,27639.5,239850.0,406.0,1213.0,...,0.0,0.0,1.0,5.0,15.0,19.0,5.0,19.0,7.5,98077
68,23926.5,23298.0,0.0,0.0,0.0,56933.0,27639.5,239850.0,406.0,1213.0,...,0.0,0.0,1.0,5.0,15.0,19.0,5.0,19.0,7.5,98030


In [24]:
test_df

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,4,1.00,1680,5043,1.5,0,0,4,6,1680,0,1911,0,98118,47.5354,-122.273,1560,5765
1,3,2.50,2220,6380,1.5,0,0,4,8,1660,560,1931,0,98115,47.6974,-122.313,950,6380
2,3,2.25,1630,10962,1.0,0,0,4,8,1100,530,1977,0,98030,47.3801,-122.166,1830,8470
3,5,2.50,1710,9720,2.0,0,0,4,8,1710,0,1974,0,98005,47.5903,-122.157,2270,9672
4,2,1.00,850,6370,1.0,0,0,3,6,850,0,1951,0,98126,47.5198,-122.373,850,5170
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,3,2.50,2430,54059,2.0,0,0,3,10,2430,0,1987,0,98027,47.4664,-121.992,2910,49658
96,2,2.50,1240,1249,3.0,0,0,3,8,1240,0,2006,0,98107,47.6718,-122.386,1240,2500
97,4,1.75,1860,9750,1.0,0,0,3,7,1460,400,1969,0,98034,47.7097,-122.202,1900,8913
98,5,1.75,2330,3800,1.5,0,0,3,7,1360,970,1927,0,98115,47.6835,-122.308,2100,3800


In [26]:
!ls model

model.pkl  model_features.json


In [13]:
model_path = path/'model/model.pkl'

In [7]:
def load_model(model_path="../model/model.pkl"):
    """
    Function to load the model

    Args
    model_path: the path to the model
    """
    if os.path.exists(model_path):
        print(f"Loading existing model from pickle at {model_path}")
        with open(model_path, 'rb') as f:
            model = pickle.load(f)
        return model
    else:
        print(f"Model file not found at {model_path}")
        return None

In [22]:
load_model

<function __main__.load_model(model_path='model/model.pkl')>

In [23]:
load_model??

[31mSignature:[39m load_model(model_path=[33m'model/model.pkl'[39m)
[31mSource:[39m   
[38;5;28;01mdef[39;00m load_model(model_path=[33m"model/model.pkl"[39m):
    [33m"""[39m
[33m    Function to load the model[39m

[33m    Args[39m
[33m    model_path: the path to the model[39m
[33m    """[39m
    [38;5;28;01mif[39;00m os.path.exists(model_path):
        print(f"Loading existing model from pickle at {model_path}")
        [38;5;28;01mwith[39;00m open(model_path, [33m'rb'[39m) [38;5;28;01mas[39;00m f:
            model = pickle.load(f)
        [38;5;28;01mreturn[39;00m model
    [38;5;28;01melse[39;00m:
        print(f"Model file not found at {model_path}")
        [38;5;28;01mreturn[39;00m [38;5;28;01mNone[39;00m
[31mFile:[39m      /tmp/ipykernel_100471/1537687095.py
[31mType:[39m      function

In [8]:
model = load_model()
model

Loading existing model from pickle at ../model/model.pkl


0,1,2
,steps,"[('robustscaler', ...), ('kneighborsregressor', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,with_centering,True
,with_scaling,True
,quantile_range,"(25.0, ...)"
,copy,True
,unit_variance,False

0,1,2
,n_neighbors,5
,weights,'uniform'
,algorithm,'auto'
,leaf_size,30
,p,2
,metric,'minkowski'
,metric_params,
,n_jobs,


In [41]:
#model.predict(test_df)

ValueError: The feature names should match those that were passed during fit.
Feature names unseen at fit time:
- condition
- grade
- lat
- long
- sqft_living15
- ...
Feature names seen at fit time, yet now missing:
- edctn_9_12_qty
- edctn_assoc_dgre_qty
- edctn_bchlr_dgre_qty
- edctn_high_schl_qty
- edctn_less_than_9_qty
- ...


## Trial solution 1 - chatgpt

In [17]:
import pandas as pd
import numpy as np

# --------------------------
# Example: model_expected list
# --------------------------
model_expected = [
    'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
    'sqft_above', 'sqft_basement',
    'ppltn_qty', 'urbn_ppltn_qty', 'sbrbn_ppltn_qty', 'farm_ppltn_qty', 'non_farm_qty',
    'medn_hshld_incm_amt', 'medn_incm_per_prsn_amt', 'hous_val_amt',
    'edctn_less_than_9_qty', 'edctn_9_12_qty', 'edctn_high_schl_qty', 'edctn_some_clg_qty',
    'edctn_assoc_dgre_qty', 'edctn_bchlr_dgre_qty', 'edctn_prfsnl_qty',
    'per_urbn', 'per_sbrbn', 'per_farm', 'per_non_farm',
    'per_less_than_9', 'per_9_to_12', 'per_hsd', 'per_some_clg',
    'per_assoc', 'per_bchlr', 'per_prfsnl'
]

# --------------------------
# 1. Merge housing with demographics
# --------------------------
# Ensure zipcodes are the same dtype (string recommended to preserve leading zeros)
test_df['zipcode'] = test_df['zipcode'].astype(str)
demographics_df['zipcode'] = demographics_df['zipcode'].astype(str)

# Merge on zipcode
inference_df = test_df.merge(demographics_df, on='zipcode', how='left')
inference_df

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,...,per_sbrbn,per_farm,per_non_farm,per_less_than_9,per_9_to_12,per_hsd,per_some_clg,per_assoc,per_bchlr,per_prfsnl
0,4,1.00,1680,5043,1.5,0,0,4,6,1680,...,0.0,0.0,0.0,9.0,9.0,17.0,15.0,4.0,11.0,6.0
1,3,2.50,2220,6380,1.5,0,0,4,8,1660,...,0.0,0.0,0.0,0.0,2.0,8.0,15.0,4.0,30.0,20.0
2,3,2.25,1630,10962,1.0,0,0,4,8,1100,...,0.0,0.0,0.0,1.0,5.0,15.0,19.0,5.0,19.0,7.5
3,5,2.50,1710,9720,2.0,0,0,4,8,1710,...,0.0,0.0,0.0,2.0,3.0,10.0,17.0,4.0,26.0,16.0
4,2,1.00,850,6370,1.0,0,0,3,6,850,...,0.0,0.0,0.0,4.0,7.0,16.0,19.0,5.0,16.0,7.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,3,2.50,2430,54059,2.0,0,0,3,10,2430,...,0.0,0.0,19.0,1.0,3.0,12.0,17.0,6.0,24.0,11.0
96,2,2.50,1240,1249,3.0,0,0,3,8,1240,...,0.0,0.0,0.0,1.0,5.0,14.0,20.0,5.0,28.0,11.0
97,4,1.75,1860,9750,1.0,0,0,3,7,1460,...,0.0,0.0,0.0,1.0,4.0,14.0,21.0,6.0,20.0,8.0
98,5,1.75,2330,3800,1.5,0,0,3,7,1360,...,0.0,0.0,0.0,0.0,2.0,8.0,15.0,4.0,30.0,20.0


In [18]:
test_df.shape,demographics_df.shape,inference_df.shape

((100, 18), (70, 27), (100, 44))

In [19]:
inference_df.columns

Index(['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long',
       'sqft_living15', 'sqft_lot15', 'ppltn_qty', 'urbn_ppltn_qty',
       'sbrbn_ppltn_qty', 'farm_ppltn_qty', 'non_farm_qty',
       'medn_hshld_incm_amt', 'medn_incm_per_prsn_amt', 'hous_val_amt',
       'edctn_less_than_9_qty', 'edctn_9_12_qty', 'edctn_high_schl_qty',
       'edctn_some_clg_qty', 'edctn_assoc_dgre_qty', 'edctn_bchlr_dgre_qty',
       'edctn_prfsnl_qty', 'per_urbn', 'per_sbrbn', 'per_farm', 'per_non_farm',
       'per_less_than_9', 'per_9_to_12', 'per_hsd', 'per_some_clg',
       'per_assoc', 'per_bchlr', 'per_prfsnl'],
      dtype='object')

In [20]:

# --------------------------
# 2. Check coverage of expected features
# --------------------------
missing_features = set(model_expected) - set(inference_df.columns)
if missing_features:
    print("WARNING: These expected features are missing after merge:", missing_features)
else:
    print("All expected features are present after merge.")


All expected features are present after merge.


In [21]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   bedrooms       100 non-null    int64  
 1   bathrooms      100 non-null    float64
 2   sqft_living    100 non-null    int64  
 3   sqft_lot       100 non-null    int64  
 4   floors         100 non-null    float64
 5   waterfront     100 non-null    int64  
 6   view           100 non-null    int64  
 7   condition      100 non-null    int64  
 8   grade          100 non-null    int64  
 9   sqft_above     100 non-null    int64  
 10  sqft_basement  100 non-null    int64  
 11  yr_built       100 non-null    int64  
 12  yr_renovated   100 non-null    int64  
 13  zipcode        100 non-null    object 
 14  lat            100 non-null    float64
 15  long           100 non-null    float64
 16  sqft_living15  100 non-null    int64  
 17  sqft_lot15     100 non-null    int64  
dtypes: float64(

In [22]:
test_df.isnull().sum()

bedrooms         0
bathrooms        0
sqft_living      0
sqft_lot         0
floors           0
waterfront       0
view             0
condition        0
grade            0
sqft_above       0
sqft_basement    0
yr_built         0
yr_renovated     0
zipcode          0
lat              0
long             0
sqft_living15    0
sqft_lot15       0
dtype: int64

In [23]:

# --------------------------
# 3. Impute missing values (basic strategy: median imputation)
# --------------------------
# (In production you should use the exact imputation strategy/scaler from training.)
for col in model_expected:
    if inference_df[col].isnull().any():
        median_val = inference_df[col].median()
        inference_df[col] = inference_df[col].fillna(median_val)
        print(f"Filled NaNs in {col} with median {median_val}")

In [24]:

# --------------------------
# 4. Reorder columns to match model input order
# --------------------------
inference_df = inference_df[model_expected]
inference_df


Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,sqft_above,sqft_basement,ppltn_qty,urbn_ppltn_qty,sbrbn_ppltn_qty,...,per_sbrbn,per_farm,per_non_farm,per_less_than_9,per_9_to_12,per_hsd,per_some_clg,per_assoc,per_bchlr,per_prfsnl
0,4,1.00,1680,5043,1.5,1680,0,40409.0,40409.0,0.0,...,0.0,0.0,0.0,9.0,9.0,17.0,15.0,4.0,11.0,6.0
1,3,2.50,2220,6380,1.5,1660,560,43263.0,43263.0,0.0,...,0.0,0.0,0.0,0.0,2.0,8.0,15.0,4.0,30.0,20.0
2,3,2.25,1630,10962,1.0,1100,530,23926.5,23298.0,0.0,...,0.0,0.0,0.0,1.0,5.0,15.0,19.0,5.0,19.0,7.5
3,5,2.50,1710,9720,2.0,1710,0,17150.0,17150.0,0.0,...,0.0,0.0,0.0,2.0,3.0,10.0,17.0,4.0,26.0,16.0
4,2,1.00,850,6370,1.0,850,0,19435.0,19435.0,0.0,...,0.0,0.0,0.0,4.0,7.0,16.0,19.0,5.0,16.0,7.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,3,2.50,2430,54059,2.0,2430,0,22271.0,18009.0,0.0,...,0.0,0.0,19.0,1.0,3.0,12.0,17.0,6.0,24.0,11.0
96,2,2.50,1240,1249,3.0,1240,0,18314.0,18314.0,0.0,...,0.0,0.0,0.0,1.0,5.0,14.0,20.0,5.0,28.0,11.0
97,4,1.75,1860,9750,1.0,1460,400,40127.0,40127.0,0.0,...,0.0,0.0,0.0,1.0,4.0,14.0,21.0,6.0,20.0,8.0
98,5,1.75,2330,3800,1.5,1360,970,43263.0,43263.0,0.0,...,0.0,0.0,0.0,0.0,2.0,8.0,15.0,4.0,30.0,20.0


### How reodering works

In [25]:
# --------------------------
# 4. Reorder columns to match model input order
# --------------------------
inference_df_d = test_df.merge(demographics_df, on='zipcode', how='left').copy()
inference_df_a = inference_df_d.copy()
inference_df_b = inference_df_a[['bedrooms','floors','zipcode']]
inference_df_b

Unnamed: 0,bedrooms,floors,zipcode
0,4,1.5,98118
1,3,1.5,98115
2,3,1.0,98030
3,5,2.0,98005
4,2,1.0,98126
...,...,...,...
95,3,2.0,98027
96,2,3.0,98107
97,4,1.0,98034
98,5,1.5,98115


In [26]:
# --------------------------
# 5. Type consistency: ensure all numeric
# --------------------------
#inference_df = inference_df.apply(pd.to_numeric, errors='coerce')

# --------------------------
# 6. Final sanity checks
# --------------------------
print("Final shape:", inference_df.shape)
print("Any NaNs left?", inference_df.isna().any().any())

# Now you can feed inference_df into your trained model:
# preds = model.predict(inference_df)

Final shape: (100, 33)
Any NaNs left? False


In [27]:
inference_df.isna().any().any()

False

In [28]:
predictions = model.predict(inference_df)
predictions.shape

(100,)

In [29]:
predictions

array([ 458520. ,  612800. ,  449160. ,  679700. ,  304256. ,  553798. ,  341800. ,  445350. ,  990500. ,  532940. ,  422700. ,
        484220. ,  499400. ,  358470. ,  790700. ,  236300. ,  426950. ,  687600. ,  619880. ,  438000. ,  520800. ,  669300.2,
        549036. ,  411100. ,  250190. ,  313590. ,  730800. ,  285730. ,  256990. ,  390200. ,  285942.4,  865700. ,  975500. ,
        494936. ,  272090. ,  297900. ,  302298. ,  612000. ,  222590. ,  297940. ,  213800. ,  796988. ,  407260. ,  307300. ,
        451000. ,  263660. ,  297560. ,  658200. ,  261500. ,  288890. , 1241796. ,  279380. ,  252390. ,  252980. ,  569370. ,
        524790. ,  602670. ,  427900. ,  406000. ,  890000. ,  486090. ,  317402. ,  886700. ,  421650. ,  321999. ,  390360. ,
        486980. ,  499000. ,  344200. ,  558650. ,  264590. ,  711190. ,  259930. ,  614000. ,  424089.8,  522800. ,  520300. ,
        412600. ,  830000. ,  258906. ,  726500. ,  565600. ,  220941.6,  404500. ,  412002.8,  795932. 

In [30]:
inference_df_c = inference_df.copy()
inference_df_c.shape

(100, 33)

In [31]:
preds = model.predict(inference_df_c)
inference_df_c["predicted_price"] = preds
inference_df_c.to_csv("predictions.csv", index=False)
print("Predictions written to predictions.csv")

Predictions written to predictions.csv


In [9]:
!ls

predictions.csv  sound_realty.ipynb


In [10]:
sub_df = pd.read_csv('predictions.csv')
sub_df

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,sqft_above,sqft_basement,ppltn_qty,urbn_ppltn_qty,sbrbn_ppltn_qty,...,per_farm,per_non_farm,per_less_than_9,per_9_to_12,per_hsd,per_some_clg,per_assoc,per_bchlr,per_prfsnl,predicted_price
0,4,1.00,1680,5043,1.5,1680,0,40409.0,40409.0,0.0,...,0.0,0.0,9.0,9.0,17.0,15.0,4.0,11.0,6.0,458520.0
1,3,2.50,2220,6380,1.5,1660,560,43263.0,43263.0,0.0,...,0.0,0.0,0.0,2.0,8.0,15.0,4.0,30.0,20.0,612800.0
2,3,2.25,1630,10962,1.0,1100,530,23926.5,23298.0,0.0,...,0.0,0.0,1.0,5.0,15.0,19.0,5.0,19.0,7.5,449160.0
3,5,2.50,1710,9720,2.0,1710,0,17150.0,17150.0,0.0,...,0.0,0.0,2.0,3.0,10.0,17.0,4.0,26.0,16.0,679700.0
4,2,1.00,850,6370,1.0,850,0,19435.0,19435.0,0.0,...,0.0,0.0,4.0,7.0,16.0,19.0,5.0,16.0,7.0,304256.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,3,2.50,2430,54059,2.0,2430,0,22271.0,18009.0,0.0,...,0.0,19.0,1.0,3.0,12.0,17.0,6.0,24.0,11.0,535800.0
96,2,2.50,1240,1249,3.0,1240,0,18314.0,18314.0,0.0,...,0.0,0.0,1.0,5.0,14.0,20.0,5.0,28.0,11.0,452800.0
97,4,1.75,1860,9750,1.0,1460,400,40127.0,40127.0,0.0,...,0.0,0.0,1.0,4.0,14.0,21.0,6.0,20.0,8.0,471817.0
98,5,1.75,2330,3800,1.5,1360,970,43263.0,43263.0,0.0,...,0.0,0.0,0.0,2.0,8.0,15.0,4.0,30.0,20.0,609388.0


In [35]:
!ls ../data

future_unseen_examples.csv  kc_house_data.csv  zipcode_demographics.csv


In [None]:
#eos

## API 

We are now going to start by uploading our data to modal using volumes. To quote the modal documentation

Modal Volumes provide a high-performance distributed file system for your modal applications. They are designed for write-once, read-many I/O workloads, like creating machine learning model weights and distributing them for inference.

Uploading our data will enable our training function that we run later to access the data it will need to train our machine learning model.

In [11]:
!modal setup

[2K[31mWas not able to launch web browser[0me web browserer
Please go to this URL manually and complete the flow:

[2K]8;id=273796;https://modal.com/token-flow/tf-VaMaCbvM7QGYy1Bn0EtX9U\[4;94mhttps://modal.com/token-flow/tf-VaMaCbvM7QGYy1Bn0EtX9U[0m]8;;\

[2K[32m‚†¶[0m Waiting for authentication in the web browser
[2K[32m‚†π[0m Waiting for token flow to complete...omplete...
[1A[2K[32mWeb authentication finished successfully![0m
[32mToken is connected to the [0m[35mflexible-functions-ai[0m[32m workspace.[0m
Verifying token against [4;34mhttps://api.modal.com[0m
[32mToken verified successfully![0m
[?25l[32m‚†ã[0m Storing token
[1A[2K[32mToken written to [0m[35m/home/zicofeadmin/[0m[35m.modal.toml[0m[32m in profile [0m[35mflexible-functions-ai[0m[32m.[0m


## Data Upload

In [56]:
!ls ../data

future_unseen_examples.csv  kc_house_data.csv  zipcode_demographics.csv


In [47]:
!python ../data_upload.py ../data

In [None]:
python data_upload.py ./data

In [65]:
from modal import App, Volume, Mount
from pathlib import Path
import shutil, os

app = App("sr-data-upload")

volume = Volume.from_name("sr-data-volume", create_if_missing=True)

# Mount local ../data into the container at /source_data
local_mount = Mount.from_local_dir("../data", remote_path="/source_data")

@app.function(volumes={"/data": volume}, mounts=[local_mount])
def upload_data():
    os.makedirs("/data", exist_ok=True)

    for file in Path("/source_data").glob("*"):
        dest = f"/data/{file.name}"
        if file.is_file():
            shutil.copy(file, dest)
            print(f"Copied {file} to {dest}")

    print("\nFiles in Modal volume:")
    for file in Path("/data").glob("*"):
        print(f" - {file}")

with app.run():
    upload_data.remote()


/tmp/ipykernel_286169/1567640505.py:10: DeprecationError: 2025-01-08: modal.Mount usage will soon be deprecated.

Use image.add_local_dir instead, which is functionally and performance-wise equivalent.

See https://modal.com/docs/guide/modal-1-0-migration for more details.

  local_mount = Mount.from_local_dir("../data", remote_path="/source_data")


In [66]:
from modal import App, Volume, Mount
from pathlib import Path
import shutil, os

# Define Modal app
app = App("sr2-data-upload")

# Persistent Modal volume
volume = Volume.from_name("sr2-data-volume", create_if_missing=True)

# Mount local ../data into the container at /source_data
local_mount = Mount.from_local_dir("../data", remote_path="/source_data")

# Function that runs inside Modal container
@app.function(volumes={"/data": volume}, mounts=[local_mount])
def upload_data():
    os.makedirs("/data", exist_ok=True)

    # Copy files from mounted dir into the volume
    for file in Path("/source_data").glob("*"):
        dest = f"/data/{file.name}"
        if file.is_file():
            shutil.copy(file, dest)
            print(f"Copied {file} to {dest}")

    # Confirm files in the volume
    print("\nFiles in Modal volume:")
    for file in Path("/data").glob("*"):
        print(f" - {file}")

# Run the app inside the notebook
with app.run():
    upload_data.remote()


/tmp/ipykernel_286169/3241485202.py:12: DeprecationError: 2025-01-08: modal.Mount usage will soon be deprecated.

Use image.add_local_dir instead, which is functionally and performance-wise equivalent.

See https://modal.com/docs/guide/modal-1-0-migration for more details.

  local_mount = Mount.from_local_dir("../data", remote_path="/source_data")


In [68]:
from modal import App, Volume, Image
from pathlib import Path
import shutil, os

# Define Modal app
app = App("sr3-data-upload")

# Persistent Modal volume
volume = Volume.from_name("sr3-data-volume", create_if_missing=True)

# Define an image with your local ../data added at /source_data
image = Image.debian_slim().add_local_dir("../data", "/source_data")

# Function that runs inside Modal container
@app.function(volumes={"/data": volume}, image=image)
def upload_data():
    os.makedirs("/data", exist_ok=True)

    # Copy files from baked-in /source_data into the persistent volume
    for file in Path("/source_data").glob("*"):
        dest = f"/data/{file.name}"
        if file.is_file():
            shutil.copy(file, dest)
            print(f"Copied {file} to {dest}")

    # Confirm files in the volume
    print("\nFiles in Modal volume:")
    for file in Path("/data").glob("*"):
        print(f" - {file}")

# Run the app inside the notebook
with app.run():
    upload_data.remote()

In [12]:
from modal import App, Volume, Image
from pathlib import Path
import boto3, shutil, os

# Modal app
app = App("sr-hybrid-upload")

# Persistent Modal volume
volume = Volume.from_name("sr-hybrid-volume", create_if_missing=True)

# Image with boto3 for S3 + optional baked-in dataset
image = (
    Image.debian_slim()
    .pip_install("boto3")
    # üëá optional: freeze a dataset into the container at build time
    .add_local_dir("../data", "/frozen_data")
)


@app.function(volumes={"/data": volume}, image=image)
def upload_data(local_files: dict = None, s3_bucket: str = None, s3_prefix: str = None, use_frozen: bool = False):
    """
    Uploads data into the Modal volume.
    - local_files: development mode, uploads directly from host
    - s3_bucket + s3_prefix: production mode, pulls from S3
    - use_frozen=True: frozen dataset baked into the container image
    """
    os.makedirs("/data", exist_ok=True)

    if local_files:
        # Dev mode
        for name, content in local_files.items():
            dest = f"/data/{name}"
            with open(dest, "wb") as f:
                f.write(content)
            print(f"[DEV] Copied {name} -> {dest}")

    elif s3_bucket and s3_prefix:
        # Prod mode
        s3 = boto3.client("s3")
        result = s3.list_objects_v2(Bucket=s3_bucket, Prefix=s3_prefix)

        for obj in result.get("Contents", []):
            key = obj["Key"]
            filename = os.path.basename(key)
            dest = f"/data/{filename}"

            s3.download_file(s3_bucket, key, dest)
            print(f"[PROD] Downloaded s3://{s3_bucket}/{key} -> {dest}")

    elif use_frozen:
        # Frozen dataset baked in at build time
        for file in Path("/frozen_data").glob("*"):
            dest = f"/data/{file.name}"
            if file.is_file():
                shutil.copy(file, dest)
                print(f"[FROZEN] Copied {file} -> {dest}")

    else:
        print("‚ö†Ô∏è No data source provided. Pass local_files, s3_bucket+s3_prefix, or use_frozen=True.")

    # Confirm files
    print("\nFiles in Modal volume:")
    for file in Path("/data").glob("*"):
        print(f" - {file}")


# === Notebook/CLI Helpers ===

def run_upload_local(path="../data"):
    """Upload local files (dev mode)."""
    local_files = {}
    for file in Path(path).glob("*"):
        if file.is_file():
            local_files[file.name] = file.read_bytes()

    with app.run():
        upload_data.remote(local_files=local_files)


def run_upload_s3(bucket, prefix):
    """Upload from S3 (prod mode)."""
    with app.run():
        upload_data.remote(s3_bucket=bucket, s3_prefix=prefix)


def run_upload_frozen():
    """Upload baked-in dataset (frozen mode)."""
    with app.run():
        upload_data.remote(use_frozen=True)

In [13]:
run_upload_frozen()

In [14]:
from fastapi import File, UploadFile, Form, HTTPException
import io

In [18]:
from modal import App, Volume, Image
from pathlib import Path
import boto3, shutil, os

# Modal app
app = App("sr-hybrid-upload-app")

# Persistent Modal volume
volume = Volume.from_name("sr-hybrid-app-volume", create_if_missing=True)

# Image with boto3 for S3 + optional baked-in dataset + baked-in model artifacts
image = (
    Image.debian_slim()
    .pip_install("boto3")
    .add_local_dir("../data", "/frozen_data")               # dataset (optional frozen)
    .add_local_dir("../model", "/frozen_model")        # model artifacts (optional frozen)
)


@app.function(volumes={"/data": volume}, image=image)
def upload_all(local_dirs: dict = None, s3_bucket: str = None, s3_prefix: str = None, use_frozen: bool = False):
    """
    Uploads training data + model artifacts into the Modal volume.

    - local_dirs: dict of {"remote_subdir": "local_path"} (dev mode)
    - s3_bucket + s3_prefix: fetch from S3 (prod mode)
    - use_frozen: copy pre-baked datasets + models
    """
    os.makedirs("/data", exist_ok=True)

    if local_dirs:
        # Dev mode
        for subdir, local_path in local_dirs.items():
            dest_dir = Path(f"/data/{subdir}")
            dest_dir.mkdir(parents=True, exist_ok=True)

            for file in Path(local_path).glob("*"):
                if file.is_file():
                    shutil.copy(file, dest_dir / file.name)
                    print(f"[DEV] Copied {file} -> {dest_dir / file.name}")

    elif s3_bucket and s3_prefix:
        # Prod mode (fetching from S3)
        s3 = boto3.client("s3")
        result = s3.list_objects_v2(Bucket=s3_bucket, Prefix=s3_prefix)

        for obj in result.get("Contents", []):
            key = obj["Key"]
            filename = os.path.basename(key)
            subdir = os.path.dirname(key).split("/")[-1]  # e.g. "model" or "data"
            dest_dir = Path(f"/data/{subdir}")
            dest_dir.mkdir(parents=True, exist_ok=True)

            dest = dest_dir / filename
            s3.download_file(s3_bucket, key, str(dest))
            print(f"[PROD] Downloaded s3://{s3_bucket}/{key} -> {dest}")

    elif use_frozen:
        # Frozen mode (both datasets + model artifacts baked in)
        for folder, frozen_path in [("data", "/frozen_data"), ("model", "/frozen_model")]:
            dest_dir = Path(f"/data/{folder}")
            dest_dir.mkdir(parents=True, exist_ok=True)

            for file in Path(frozen_path).glob("*"):
                if file.is_file():
                    shutil.copy(file, dest_dir / file.name)
                    print(f"[FROZEN] Copied {file} -> {dest_dir / file.name}")

    else:
        print("‚ö†Ô∏è No source provided. Pass local_dirs, or s3_bucket+s3_prefix, or use_frozen=True.")

    # Confirm
    print("\nFiles now in Modal volume:")
    for file in Path("/data").rglob("*"):
        print(f" - {file}")


# === Notebook/CLI helpers ===

def run_upload_local():
    """Upload local dataset + model artifacts (dev mode)."""
    local_dirs = {
        "data": "../data",                  # training data
        "model": "../model"            # model artifacts (pkl, json, etc.)
    }
    with app.run():
        upload_all.remote(local_dirs=local_dirs)


def run_upload_s3(bucket, prefix):
    """Upload from S3 (prod mode)."""
    with app.run():
        upload_all.remote(s3_bucket=bucket, s3_prefix=prefix)


def run_upload_frozen():
    """Upload frozen dataset + model artifacts (frozen mode)."""
    with app.run():
        upload_all.remote(use_frozen=True)


In [19]:
run_upload_frozen()

In [32]:
!ls ../model/

model.pkl  model_features.json


## Model Serving

In [33]:
import modal
import pandas as pd
import numpy as np
from fastapi import File, UploadFile, Form, HTTPException
import io
import json
import pickle
from pathlib import Path

# Create app definition
app = modal.App("sr-hybrid-sales-api")

# Define base image with all dependencies
base_image = (modal.Image.debian_slim()
        .pip_install("pydantic==1.10.8")        
        .pip_install("fastapi==0.95.2")         
        .pip_install("uvicorn==0.22.0")         
        .pip_install([                         
            "xgboost==1.7.6",
            "scikit-learn==1.3.1",
            "pandas",
            "numpy",
        ]))

# Create volume to access data (changed volume name)
data_volume = modal.Volume.from_name("sr-hybrid-app-volume")

# Simple health endpoint
@app.function(image=base_image)
@modal.fastapi_endpoint(method="GET")
def health():
   """Health check endpoint to verify the API is running"""
   return {"status": "healthy", "service": "sr-hybrid-sales-api"}

# Function to load the existing model
@app.function(image=base_image, volumes={"/data": data_volume})
def serve_model():
   """Load the existing trained model from pickle"""
   import pickle
   import os
   
   model_path = "/data/model/model.pkl"
   
   try:
       if os.path.exists(model_path):
           print(f"Loading existing model from {model_path}")
           with open(model_path, 'rb') as f:
               model = pickle.load(f)
           print("Model loaded successfully!")
           return model
       else:
           raise FileNotFoundError(f"Model file not found at {model_path}")
           
   except Exception as e:
       import traceback
       print(f"Error loading model: {str(e)}")
       print(traceback.format_exc())
       raise

# Function to load model features
@app.function(image=base_image, volumes={"/data": data_volume})
def load_model_features():
   """Load the expected model features from JSON"""
   import json
   
   features_path = "/data/model/model_features.json"
   
   try:
       if os.path.exists(features_path):
           with open(features_path, 'r') as f:
               features = json.load(f)
           print(f"Loaded {len(features)} expected features")
           return features
       else:
           raise FileNotFoundError(f"Features file not found at {features_path}")
           
   except Exception as e:
       import traceback
       print(f"Error loading features: {str(e)}")
       print(traceback.format_exc())
       raise

# CSV upload endpoint with new preprocessing pipeline
@app.function(image=base_image, volumes={"/data": data_volume})
@modal.fastapi_endpoint(method="POST")
async def predict_csv(file: UploadFile = File(...)):
    """API endpoint for batch predictions from a CSV file using existing model"""
    import pandas as pd
    import io
    import pickle
    import os
    import traceback
    from pathlib import Path
    
    try:
        # Load the pre-trained model
        model = serve_model.remote()
        print("Model loaded successfully")
        
        # Load expected features
        expected_features = load_model_features.remote()
        print(f"Expected features loaded: {len(expected_features)} features")
        
        # Read uploaded CSV file content
        contents = await file.read()
        
        # Parse CSV data
        try:
            test_df = pd.read_csv(io.BytesIO(contents))
            print(f"Test data shape: {test_df.shape}")
            print(f"Test data columns: {test_df.columns.tolist()}")
        except Exception as e:
            return {
                "success": False,
                "error": f"Failed to parse uploaded CSV: {str(e)}"
            }
        
        # Load demographic data for merging
        demographics_path = "/data/data/zipcode_demographics.csv"
        if os.path.exists(demographics_path):
            demo_df = pd.read_csv(demographics_path)
            print(f"Demographics data shape: {demo_df.shape}")
            
            # Merge test data with demographics
            # Assuming 'zipcode' is the common column - adjust if different
            if 'zipcode' in test_df.columns and 'zipcode' in demo_df.columns:
                merged_df = pd.merge(test_df, demo_df, on='zipcode', how='left')
                print(f"Merged data shape: {merged_df.shape}")
            else:
                print("Warning: zipcode column not found, using test data as-is")
                merged_df = test_df.copy()
        else:
            print("Warning: Demographics file not found, using test data as-is")
            merged_df = test_df.copy()
        
        # Select only the expected features
        available_features = [col for col in expected_features if col in merged_df.columns]
        missing_features = [col for col in expected_features if col not in merged_df.columns]
        
        if missing_features:
            print(f"Warning: Missing features: {missing_features}")
            # Add missing features with default values (0 or mean/median)
            for feature in missing_features:
                merged_df[feature] = 0  # or use a more sophisticated default
        
        # Select the exact features expected by the model
        feature_df = merged_df[expected_features].copy()
        print(f"Final feature matrix shape: {feature_df.shape}")
        
        # Handle any remaining missing values
        feature_df = feature_df.fillna(0)  # or use more sophisticated imputation
        
        # Make predictions
        predictions = model.predict(feature_df)
        print(f"Predictions shape: {predictions.shape}")
        
        # Return predictions as a list
        return predictions.tolist()
            
    except Exception as e:
        import traceback
        return {
            "success": False,
            "error": f"Error processing CSV: {str(e)}",
            "traceback": traceback.format_exc()
        }

@app.local_entrypoint()
def main():
   """Local entrypoint for testing the API"""
   print("Starting sr-hybrid-sales-api...")
   
   # Pre-load the model to ensure it exists
   print("Preparing model...")
   serve_model.remote()
   print("Model preparation complete!")
   
   print("\nAPI is ready for use at:")
   print("- Health check: https://flexible-functions-ai--sr-hybrid-sales-api-health.modal.run")
   print("- CSV predictions: https://flexible-functions-ai--sr-hybrid-sales-api-predict-csv.modal.run")

In [None]:
# Deploy the app
app.deploy()

In [38]:
import requests
import pandas as pd
import json
from pathlib import Path
import time

class ModalAPITester:
    def __init__(self, base_url=None):
        """
        Initialize the tester with base URL
        You'll need to update these URLs after deployment
        """
        if base_url:
            self.health_url = f"{https://flexible-functions-ai--sr-hybrid-sales-api}-health.modal.run"
            self.predict_url = f"{https://flexible-functions-ai--sr-hybrid-sales-api}-predict-csv.modal.run"
        else:
            # You'll need to replace these with your actual deployed URLs
            self.health_url = "https://flexible-functions-ai--sr-hybrid-sales-api-health.modal.run"
            self.predict_url = "{https://flexible-functions-ai--sr-hybrid-sales-api-predict-csv.modal.run"
    
    def test_health_endpoint(self):
        """Test the health check endpoint"""
        print("=" * 50)
        print("Testing Health Endpoint")
        print("=" * 50)
        
        try:
            response = requests.get(self.health_url, timeout=10)
            
            print(f"Status Code: {response.status_code}")
            print(f"Response: {response.json()}")
            
            if response.status_code == 200:
                print("‚úÖ Health endpoint is working!")
                return True
            else:
                print("‚ùå Health endpoint failed!")
                return False
                
        except requests.exceptions.RequestException as e:
            print(f"‚ùå Error connecting to health endpoint: {str(e)}")
            print("Make sure your Modal app is deployed and the URL is correct")
            return False
    
    def test_predict_endpoint(self, csv_file_path=None):
        """Test the CSV prediction endpoint"""
        print("\n" + "=" * 50)
        print("Testing Prediction Endpoint")
        print("=" * 50)
        
        # Default test data
        if csv_file_path is None:
            csv_file_path = self.create_test_csv()
        
        try:
            # Check if file exists
            if not Path(csv_file_path).exists():
                print(f"‚ùå Test file not found: {csv_file_path}")
                return False
            
            print(f"Using test file: {csv_file_path}")
            
            # Read and display file info
            test_df = pd.read_csv(csv_file_path)
            print(f"Test data shape: {test_df.shape}")
            print(f"Test data columns: {test_df.columns.tolist()}")
            print(f"First few rows:\n{test_df.head()}")
            
            # Prepare the file for upload
            with open(csv_file_path, 'rb') as f:
                files = {'file': ('test_data.csv', f, 'text/csv')}
                
                print(f"\nSending request to: {self.predict_url}")
                print("This might take a moment...")
                
                # Make the request
                response = requests.post(
                    self.predict_url, 
                    files=files,
                    timeout=60  # Increase timeout for model inference
                )
            
            print(f"Status Code: {response.status_code}")
            
            if response.status_code == 200:
                try:
                    result = response.json()
                    
                    if isinstance(result, list):
                        # Direct predictions
                        predictions = result
                        print(f"‚úÖ Predictions received!")
                        print(f"Number of predictions: {len(predictions)}")
                        print(f"First 5 predictions: {predictions[:5]}")
                        print(f"Prediction range: {min(predictions):.2f} to {max(predictions):.2f}")
                        return True
                        
                    elif isinstance(result, dict) and result.get('success') == False:
                        # Error response
                        print(f"‚ùå API returned error: {result.get('error')}")
                        if 'traceback' in result:
                            print(f"Traceback: {result['traceback']}")
                        return False
                        
                    else:
                        # Structured response
                        print(f"Response: {result}")
                        return True
                        
                except json.JSONDecodeError:
                    print(f"‚ùå Could not parse JSON response: {response.text}")
                    return False
            else:
                print(f"‚ùå Request failed with status {response.status_code}")
                print(f"Response: {response.text}")
                return False
                
        except requests.exceptions.Timeout:
            print("‚ùå Request timed out. The model might be taking too long to respond.")
            return False
        except requests.exceptions.RequestException as e:
            print(f"‚ùå Error making request: {str(e)}")
            return False
    
    def create_test_csv(self):
        """Create a simple test CSV if none provided"""
        test_data = {
            'bedrooms': [3, 4, 2, 5],
            'bathrooms': [2.0, 3.0, 1.5, 2.5],
            'sqft_living': [1500, 2000, 1200, 2500],
            'sqft_lot': [5000, 6000, 4000, 7000],
            'floors': [1, 2, 1, 2],
            'zipcode': [98001, 98002, 98003, 98004]  # Assuming these exist in demographics
        }
        
        test_df = pd.DataFrame(test_data)
        test_file = 'test_predictions.csv'
        test_df.to_csv(test_file, index=False)
        
        print(f"Created test CSV: {test_file}")
        return test_file
    
    def run_full_test(self, csv_file_path=None):
        """Run complete test suite"""
        print("üöÄ Starting Modal API Tests")
        print(f"Health URL: {self.health_url}")
        print(f"Predict URL: {self.predict_url}")
        
        # Test 1: Health check
        health_passed = self.test_health_endpoint()
        
        if not health_passed:
            print("\n‚ùå Health check failed. Skipping prediction test.")
            return False
        
        # Test 2: Prediction endpoint
        predict_passed = self.test_predict_endpoint(csv_file_path)
        
        # Summary
        print("\n" + "=" * 50)
        print("TEST SUMMARY")
        print("=" * 50)
        print(f"Health Endpoint: {'‚úÖ PASS' if health_passed else '‚ùå FAIL'}")
        print(f"Predict Endpoint: {'‚úÖ PASS' if predict_passed else '‚ùå FAIL'}")
        
        if health_passed and predict_passed:
            print("\nüéâ All tests passed! Your API is working correctly.")
            return True
        else:
            print("\nüòû Some tests failed. Check the errors above.")
            return False

# Usage examples:
def test_with_default_data():
    """Test with automatically generated data"""
    tester = ModalAPITester()
    # Update URLs after deployment
    tester.health_url = "https://flexible-functions-ai--sr-hybrid-sales-api-health.modal.run/"
    tester.predict_url = "https://flexible-functions-ai--sr-hybrid-sales-api-predict-csv.modal.run"
    
    return tester.run_full_test()

def test_with_your_data():
    """Test with your actual test data"""
    tester = ModalAPITester()
    # Update URLs after deployment
    tester.health_url = "https://flexible-functions-ai--sr-hybrid-sales-api-health.modal.run/"
    tester.predict_url = "https://flexible-functions-ai--sr-hybrid-sales-api-predict-csv.modal.run"
    
    # Use your actual test file
    csv_file = "../data/future_unseen_examples.csv"  # Adjust path as needed
    return tester.run_full_test(csv_file)

def quick_test():
    """Quick test function for notebook use"""
    # Replace these URLs with your actual deployment URLs
    health_url = "https://flexible-functions-ai--sr-hybrid-sales-api-health.modal.run/"
    predict_url = "https://flexible-functions-ai--sr-hybrid-sales-api-predict-csv.modal.run"
    
    tester = ModalAPITester()
    tester.health_url = health_url
    tester.predict_url = predict_url
    
    return tester.run_full_test()

if __name__ == "__main__":
    # Run the test
    print("Modal API Tester")
    print("Remember to update the URLs with your actual deployed endpoints!")
    
    # Create tester instance
    tester = ModalAPITester()
    
    # You MUST update these URLs after deployment
    print("\n‚ö†Ô∏è  IMPORTANT: Update these URLs with your actual deployment URLs:")
    print(f"Current health URL: {tester.health_url}")
    print(f"Current predict URL: {tester.predict_url}")
    
    # Uncomment to run with default test data:
    #test_with_default_data()
    
    # Uncomment to run with your actual data:
    # test_with_your_data()

Modal API Tester
Remember to update the URLs with your actual deployed endpoints!

‚ö†Ô∏è  IMPORTANT: Update these URLs with your actual deployment URLs:
Current health URL: https://flexible-functions-ai--sr-hybrid-sales-api-health.modal.run
Current predict URL: {https://flexible-functions-ai--sr-hybrid-sales-api-predict-csv.modal.run


In [39]:
test_with_your_data()

üöÄ Starting Modal API Tests
Health URL: https://flexible-functions-ai--sr-hybrid-sales-api-health.modal.run/
Predict URL: https://flexible-functions-ai--sr-hybrid-sales-api-predict-csv.modal.run
Testing Health Endpoint
‚ùå Error connecting to health endpoint: HTTPSConnectionPool(host='flexible-functions-ai--sr-hybrid-sales-api-health.modal.run', port=443): Read timed out. (read timeout=10)
Make sure your Modal app is deployed and the URL is correct

‚ùå Health check failed. Skipping prediction test.


False

In [22]:
from modal import App, Volume, Image
from fastapi import UploadFile, File, HTTPException
import io, json, joblib, pandas as pd
from pathlib import Path

# === Modal setup ===
app = App("sr-house-price-api")

volume = Volume.from_name("sr-hybrid-app-volume")

image = (
    Image.debian_slim()
    .pip_install(
        "fastapi",
        "uvicorn",
        "scikit-learn",
        "pandas",
        "numpy",
        "joblib",
        "bentoml",          # if you want Bento validation
        "dash",             # if you plan to expose a Dash UI alongside
    )
)

# === Globals cached in container ===
MODEL = None
FEATURES = None
DEMOGRAPHICS = None


def _load_artifacts():
    """Load model, features, demographics into globals (once per container)."""
    global MODEL, FEATURES, DEMOGRAPHICS

    if MODEL is not None:
        return  # already loaded

    model_path = Path("/data/model/model.pkl")
    features_path = Path("/data/model/model_features.json")
    demo_path = Path("/data/data/demographics.csv")

    if not model_path.exists() or not features_path.exists():
        raise RuntimeError("Model artifacts missing in Modal volume (/data/model/*).")

    MODEL = joblib.load(model_path)
    with open(features_path) as f:
        FEATURES = json.load(f)

    if not demo_path.exists():
        raise RuntimeError("Demographics file missing in /data/data/demographics.csv")
    DEMOGRAPHICS = pd.read_csv(demo_path)


def _prepare_feature_vector(zipcode: str, extra_inputs: dict):
    """Return a feature vector aligned with model features."""
    row = DEMOGRAPHICS.loc[DEMOGRAPHICS["zipcode"] == str(zipcode)]
    if row.empty:
        raise ValueError(f"Zipcode {zipcode} not found in demographics.")
    demo_row = row.iloc[0].to_dict()

    merged = {**demo_row, **(extra_inputs or {})}

    vector = []
    for feat in FEATURES:
        if feat not in merged:
            raise ValueError(f"Missing required feature: {feat}")
        vector.append(merged[feat])
    return vector


# === Endpoints ===

@app.function(image=image, volumes={"/data": volume})
@modal.fastapi_endpoint(method="GET")
def health():
    """Health check endpoint"""
    return {"status": "healthy", "service": "sr-house-price-api"}


@app.function(image=image, volumes={"/data": volume})
@modal.fastapi_endpoint(method="POST")
async def predict_single(request: dict):
    """
    Single prediction endpoint.
    Example input:
    {
        "zipcode": "98109",
        "extra_inputs": {"sqft": 1500, "num_bedrooms": 3}
    }
    """
    try:
        _load_artifacts()
        zipcode = request.get("zipcode")
        if not zipcode:
            raise ValueError("zipcode is required")

        vector = _prepare_feature_vector(zipcode, request.get("extra_inputs"))
        pred = MODEL.predict([vector])[0]
        return {"prediction": float(pred)}

    except Exception as e:
        raise HTTPException(status_code=400, detail=str(e))


@app.function(image=image, volumes={"/data": volume})
@modal.fastapi_endpoint(method="POST")
async def predict_batch(file: UploadFile = File(...)):
    """
    Batch prediction endpoint.
    CSV must include 'zipcode' column plus any extra input features.
    """
    try:
        _load_artifacts()

        contents = await file.read()
        df = pd.read_csv(io.BytesIO(contents))

        if "zipcode" not in df.columns:
            raise ValueError("CSV must include 'zipcode' column")

        preds = []
        for _, row in df.iterrows():
            try:
                vector = _prepare_feature_vector(
                    row["zipcode"], row.drop("zipcode").to_dict()
                )
                pred = MODEL.predict([vector])[0]
                preds.append({"zipcode": row["zipcode"], "prediction": float(pred)})
            except Exception as inner_e:
                preds.append({"zipcode": row["zipcode"], "error": str(inner_e)})

        return preds

    except Exception as e:
        raise HTTPException(status_code=400, detail=str(e))


# === Local entrypoint for testing ===
@app.local_entrypoint()
def main():
    print("üè° House Price Prediction API")
    print("Health check: https://<your-app-name>--sr-house-price-api-health.modal.run")
    print("Single prediction: https://<your-app-name>--sr-house-price-api-predict-single.modal.run")
    print("Batch prediction: https://<your-app-name>--sr-house-price-api-predict-batch.modal.run")


In [23]:
app.run()

<contextlib.Blocking_AsyncGeneratorContextManager at 0x7f17bb91c920>

In [24]:
app.deploy()

<modal.app.App at 0x7f17b2d39790>

In [28]:
"""
Real Estate Price Prediction API Service
Serves the phData ML model with backend demographic data integration
"""

from modal import App, Volume, Image
from pathlib import Path
import pandas as pd
import numpy as np
import pickle
import json
import os
from typing import Dict, List, Optional, Union
from fastapi import HTTPException
from pydantic import BaseModel, Field, validator
from datetime import datetime

# =====================================================
# Modal App Configuration
# =====================================================
app = App("real-estate-prediction-api")

# Create image with required dependencies
base_image = (
    Image.debian_slim()
    .pip_install([
        "pandas==2.0.3",
        "numpy==1.24.3",
        "scikit-learn==1.3.1",
        "xgboost==1.7.6",
        "fastapi==0.95.2",
        "pydantic==1.10.8",
        "uvicorn==0.22.0",
    ])
)

# Volume for model and data persistence
volume = Volume.from_name("sr-hybrid-app-volume", create_if_missing=True)

# =====================================================
# Data Models / Request-Response Schemas
# =====================================================

class HouseFeatures(BaseModel):
    """Schema for single house prediction request"""
    bedrooms: int = Field(..., ge=0, le=15, description="Number of bedrooms")
    bathrooms: float = Field(..., ge=0, le=10, description="Number of bathrooms")
    sqft_living: int = Field(..., gt=0, description="Square feet of living space")
    sqft_lot: int = Field(..., gt=0, description="Square feet of lot")
    floors: float = Field(..., ge=1, le=4, description="Number of floors")
    waterfront: int = Field(..., ge=0, le=1, description="Waterfront property (0/1)")
    view: int = Field(..., ge=0, le=4, description="View quality (0-4)")
    condition: int = Field(..., ge=1, le=5, description="Condition of house (1-5)")
    grade: int = Field(..., ge=1, le=13, description="Grade of house (1-13)")
    sqft_above: int = Field(..., ge=0, description="Square feet above ground")
    sqft_basement: int = Field(..., ge=0, description="Square feet of basement")
    yr_built: int = Field(..., ge=1900, le=2025, description="Year built")
    yr_renovated: int = Field(..., ge=0, le=2025, description="Year renovated (0 if never)")
    zipcode: int = Field(..., description="Zipcode")
    lat: float = Field(..., ge=47.0, le=48.0, description="Latitude")
    long: float = Field(..., ge=-123.0, le=-121.0, description="Longitude")
    sqft_living15: int = Field(..., gt=0, description="Living space of nearest 15 neighbors")
    sqft_lot15: int = Field(..., gt=0, description="Lot size of nearest 15 neighbors")
    
    @validator('yr_renovated')
    def validate_renovation_year(cls, v, values):
        """Ensure renovation year is after build year if renovated"""
        if v > 0 and 'yr_built' in values and v < values['yr_built']:
            raise ValueError('Renovation year must be after build year')
        return v

class BatchPredictionRequest(BaseModel):
    """Schema for batch prediction request"""
    houses: List[HouseFeatures]

class PredictionResponse(BaseModel):
    """Schema for single prediction response"""
    predicted_price: float
    confidence_interval: Optional[Dict[str, float]] = None
    zipcode: int
    has_demographics: bool
    model_version: str
    timestamp: str

class BatchPredictionResponse(BaseModel):
    """Schema for batch prediction response"""
    predictions: List[PredictionResponse]
    total_houses: int
    houses_with_demographics: int
    model_version: str
    timestamp: str

class ModelInfoResponse(BaseModel):
    """Schema for model info response"""
    model_version: str
    expected_features: Dict[str, List[str]]
    total_features_required: int
    available_zipcodes: int
    last_updated: str

# =====================================================
# Model and Data Management Class
# =====================================================

class ModelService:
    """Handles model loading, data preprocessing, and predictions"""
    
    def __init__(self):
        self.model = None
        self.demographics_df = None
        self.model_features = None
        self.model_version = "v1.0.0"
        self.model_expected = [
            'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
            'sqft_above', 'sqft_basement',
            'ppltn_qty', 'urbn_ppltn_qty', 'sbrbn_ppltn_qty', 'farm_ppltn_qty', 'non_farm_qty',
            'medn_hshld_incm_amt', 'medn_incm_per_prsn_amt', 'hous_val_amt',
            'edctn_less_than_9_qty', 'edctn_9_12_qty', 'edctn_high_schl_qty', 'edctn_some_clg_qty',
            'edctn_assoc_dgre_qty', 'edctn_bchlr_dgre_qty', 'edctn_prfsnl_qty',
            'per_urbn', 'per_sbrbn', 'per_farm', 'per_non_farm',
            'per_less_than_9', 'per_9_to_12', 'per_hsd', 'per_some_clg',
            'per_assoc', 'per_bchlr', 'per_prfsnl'
        ]
        
    def load_model(self, model_path: str = "/data/model/model.pkl") -> bool:
        """Load the trained model from pickle file"""
        try:
            if os.path.exists(model_path):
                print(f"Loading model from {model_path}")
                with open(model_path, 'rb') as f:
                    self.model = pickle.load(f)
                
                # Try to load model features if available
                features_path = model_path.replace('model.pkl', 'model_features.json')
                if os.path.exists(features_path):
                    with open(features_path, 'r') as f:
                        self.model_features = json.load(f)
                
                print("Model loaded successfully")
                return True
            else:
                print(f"Model file not found at {model_path}")
                return False
        except Exception as e:
            print(f"Error loading model: {str(e)}")
            return False
    
    def load_demographics(self, demographics_path: str = "/data/data/zipcode_demographics.csv") -> bool:
        """Load and cache demographics data"""
        try:
            if os.path.exists(demographics_path):
                print(f"Loading demographics from {demographics_path}")
                self.demographics_df = pd.read_csv(demographics_path)
                # Convert zipcode to string to handle potential leading zeros
                self.demographics_df['zipcode'] = self.demographics_df['zipcode'].astype(str)
                
                # Create a lookup dictionary for faster access
                self.demographics_lookup = self.demographics_df.set_index('zipcode').to_dict('index')
                
                print(f"Demographics loaded: {len(self.demographics_df)} zipcodes")
                return True
            else:
                print(f"Demographics file not found at {demographics_path}")
                return False
        except Exception as e:
            print(f"Error loading demographics: {str(e)}")
            return False
    
    def merge_with_demographics(self, house_df: pd.DataFrame) -> pd.DataFrame:
        """Merge house features with demographic data"""
        # Convert zipcode to string for matching
        house_df['zipcode'] = house_df['zipcode'].astype(str)
        
        # Merge with demographics
        merged_df = house_df.merge(self.demographics_df, on='zipcode', how='left')
        
        # Handle missing demographics (zipcodes not in demographics data)
        if merged_df.isnull().any().any():
            # Fill missing demographic features with median values
            demographic_cols = [col for col in self.demographics_df.columns if col != 'zipcode']
            for col in demographic_cols:
                if col in merged_df.columns:
                    median_val = self.demographics_df[col].median()
                    merged_df[col].fillna(median_val, inplace=True)
                    
        return merged_df
    
    def prepare_features(self, merged_df: pd.DataFrame) -> pd.DataFrame:
        """Select and order features as expected by the model"""
        # Select only the features the model expects
        inference_df = merged_df[self.model_expected].copy()
        
        # Ensure all features are numeric
        for col in inference_df.columns:
            inference_df[col] = pd.to_numeric(inference_df[col], errors='coerce')
        
        # Handle any remaining NaN values
        inference_df.fillna(0, inplace=True)
        
        return inference_df
    
    def predict_single(self, house_features: dict) -> dict:
        """Make prediction for a single house"""
        # Convert to DataFrame
        house_df = pd.DataFrame([house_features])
        
        # Store original zipcode for response
        original_zipcode = house_features['zipcode']
        
        # Merge with demographics
        merged_df = self.merge_with_demographics(house_df)
        
        # Check if demographics were found
        has_demographics = str(original_zipcode) in self.demographics_lookup
        
        # Prepare features
        inference_df = self.prepare_features(merged_df)
        
        # Make prediction
        prediction = self.model.predict(inference_df)[0]
        
        # Calculate confidence interval (simplified - you can use model-specific methods)
        # For XGBoost, you might use prediction intervals or quantile regression
        confidence_interval = {
            "lower_bound": float(prediction * 0.9),  # Simplified 10% interval
            "upper_bound": float(prediction * 1.1)
        }
        
        return {
            "predicted_price": float(prediction),
            "confidence_interval": confidence_interval,
            "zipcode": original_zipcode,
            "has_demographics": has_demographics,
            "model_version": self.model_version,
            "timestamp": datetime.utcnow().isoformat()
        }
    
    def predict_batch(self, houses: List[dict]) -> dict:
        """Make predictions for multiple houses"""
        # Convert to DataFrame
        houses_df = pd.DataFrame(houses)
        
        # Store original zipcodes
        original_zipcodes = houses_df['zipcode'].tolist()
        
        # Merge with demographics
        merged_df = self.merge_with_demographics(houses_df)
        
        # Check demographics availability
        houses_with_demographics = sum(
            1 for zipcode in original_zipcodes 
            if str(zipcode) in self.demographics_lookup
        )
        
        # Prepare features
        inference_df = self.prepare_features(merged_df)
        
        # Make predictions
        predictions = self.model.predict(inference_df)
        
        # Build response
        prediction_responses = []
        for i, (prediction, zipcode) in enumerate(zip(predictions, original_zipcodes)):
            prediction_responses.append({
                "predicted_price": float(prediction),
                "confidence_interval": {
                    "lower_bound": float(prediction * 0.9),
                    "upper_bound": float(prediction * 1.1)
                },
                "zipcode": zipcode,
                "has_demographics": str(zipcode) in self.demographics_lookup,
                "model_version": self.model_version,
                "timestamp": datetime.utcnow().isoformat()
            })
        
        return {
            "predictions": prediction_responses,
            "total_houses": len(houses),
            "houses_with_demographics": houses_with_demographics,
            "model_version": self.model_version,
            "timestamp": datetime.utcnow().isoformat()
        }

# =====================================================
# API Endpoints
# =====================================================

# Initialize model service (will be done once when container starts)
model_service = ModelService()

@app.function(
    image=base_image,
    volumes={"/data": volume},
    container_idle_timeout=300,  # Keep warm for 5 minutes
    memory=2048,  # 2GB memory
)
@modal.fastapi_endpoint(method="GET")
def health():
    """Health check endpoint"""
    return {
        "status": "healthy",
        "service": "real-estate-prediction-api",
        "model_loaded": model_service.model is not None,
        "demographics_loaded": model_service.demographics_df is not None,
        "timestamp": datetime.utcnow().isoformat()
    }

@app.function(
    image=base_image,
    volumes={"/data": volume},
    container_idle_timeout=300,
    memory=2048,
)
@modal.fastapi_endpoint(method="POST")
def predict(house: HouseFeatures):
    """Single house price prediction endpoint"""
    # Initialize model if not loaded
    if model_service.model is None:
        if not model_service.load_model():
            raise HTTPException(status_code=500, detail="Model not available")
    
    if model_service.demographics_df is None:
        if not model_service.load_demographics():
            raise HTTPException(status_code=500, detail="Demographics data not available")
    
    try:
        # Convert Pydantic model to dict
        house_dict = house.dict()
        
        # Make prediction
        result = model_service.predict_single(house_dict)
        
        return PredictionResponse(**result)
        
    except Exception as e:
        raise HTTPException(status_code=500, detail=f"Prediction error: {str(e)}")

@app.function(
    image=base_image,
    volumes={"/data": volume},
    container_idle_timeout=300,
    memory=2048,
)
@modal.fastapi_endpoint(method="POST")
def predict_batch(request: BatchPredictionRequest):
    """Batch prediction endpoint for multiple houses"""
    # Initialize model if not loaded
    if model_service.model is None:
        if not model_service.load_model():
            raise HTTPException(status_code=500, detail="Model not available")
    
    if model_service.demographics_df is None:
        if not model_service.load_demographics():
            raise HTTPException(status_code=500, detail="Demographics data not available")
    
    try:
        # Convert Pydantic models to dicts
        houses_dicts = [house.dict() for house in request.houses]
        
        # Make predictions
        result = model_service.predict_batch(houses_dicts)
        
        return BatchPredictionResponse(**result)
        
    except Exception as e:
        raise HTTPException(status_code=500, detail=f"Batch prediction error: {str(e)}")

@app.function(
    image=base_image,
    volumes={"/data": volume},
    container_idle_timeout=300,
    memory=2048,
)
@modal.fastapi_endpoint(method="GET")
def model_info():
    """Get model information and expected features"""
    # Initialize if needed
    if model_service.demographics_df is None:
        model_service.load_demographics()
    
    house_features = [
        'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
        'waterfront', 'view', 'condition', 'grade', 'sqft_above', 
        'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 
        'lat', 'long', 'sqft_living15', 'sqft_lot15'
    ]
    
    demographic_features = [
        'ppltn_qty', 'urbn_ppltn_qty', 'sbrbn_ppltn_qty', 'farm_ppltn_qty', 'non_farm_qty',
        'medn_hshld_incm_amt', 'medn_incm_per_prsn_amt', 'hous_val_amt',
        'edctn_less_than_9_qty', 'edctn_9_12_qty', 'edctn_high_schl_qty', 
        'edctn_some_clg_qty', 'edctn_assoc_dgre_qty', 'edctn_bchlr_dgre_qty', 
        'edctn_prfsnl_qty', 'per_urbn', 'per_sbrbn', 'per_farm', 'per_non_farm',
        'per_less_than_9', 'per_9_to_12', 'per_hsd', 'per_some_clg',
        'per_assoc', 'per_bchlr', 'per_prfsnl'
    ]
    
    return ModelInfoResponse(
        model_version=model_service.model_version,
        expected_features={
            "input_features": house_features,
            "demographic_features": demographic_features,
            "model_features": model_service.model_expected
        },
        total_features_required=len(model_service.model_expected),
        available_zipcodes=len(model_service.demographics_df) if model_service.demographics_df is not None else 0,
        last_updated=datetime.utcnow().isoformat()
    )

@app.function(
    image=base_image,
    volumes={"/data": volume},
    container_idle_timeout=300,
    memory=2048,
)
@modal.fastapi_endpoint(method="GET")
def available_zipcodes():
    """Get list of zipcodes with demographic data available"""
    if model_service.demographics_df is None:
        if not model_service.load_demographics():
            raise HTTPException(status_code=500, detail="Demographics data not available")
    
    zipcodes = model_service.demographics_df['zipcode'].unique().tolist()
    return {
        "zipcodes": sorted(zipcodes),
        "total": len(zipcodes)
    }

# =====================================================
# Local Entry Point for Deployment
# =====================================================

@app.local_entrypoint()
def main():
    """Deploy the API and pre-load model/data"""
    print("="*60)
    print("Deploying Real Estate Prediction API...")
    print("="*60)
    
    # Pre-warm the service by loading model and data
    print("\nPre-loading model and data...")
    
    # Trigger model loading
    with app.run():
        health_status = health.remote()
        print(f"\nHealth check: {health_status}")
    
    print("\n" + "="*60)
    print("API Deployment Complete!")
    print("="*60)
    print("\nEndpoints available:")
    print("- Health Check: GET  /health")
    print("- Single Prediction: POST /predict")
    print("- Batch Prediction: POST /predict_batch")
    print("- Model Info: GET /model_info")
    print("- Available Zipcodes: GET /available_zipcodes")
    print("\nYour API endpoints will be available at:")
    print("https://[your-username]--real-estate-prediction-api-[endpoint].modal.run")
    print("="*60)

if __name__ == "__main__":
    main()

/tmp/ipykernel_105302/2105608628.py:65: PydanticDeprecatedSince20: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  @validator('yr_renovated')
/tmp/ipykernel_105302/2105608628.py:291: DeprecationError: 2025-02-24: We have renamed several parameters related to autoscaling. Please update your code to use the following new names:

- container_idle_timeout -> scaledown_window

See https://modal.com/docs/guide/modal-1-0-migration for more details.
  @app.function(
/tmp/ipykernel_105302/2105608628.py:308: DeprecationError: 2025-02-24: We have renamed several parameters related to autoscaling. Please update your code to use the following new names:

- container_idle_timeout -> scaledown_window

See https://modal.com/docs/guide/modal-1-0-migrat

Deploying Real Estate Prediction API...

Pre-loading model and data...


RemoteError: Image build for im-GhsUxCwkXHCE5fo6Fd59Bg failed with the exception:


In [29]:
# Create app definition
app = modal.App("sr-hybrid-upload-app")

# Define base image with all dependencies
base_image = (modal.Image.debian_slim()
        .pip_install("fastapi","uvicorn","scikit-learn","pandas","numpy","joblib","dash"))

# Create the fastai image by extending the base image
fastai_image = (base_image
               .pip_install(["fastai", "torch"]))

# Create volume to access data
data_volume = modal.Volume.from_name("sr-hybrid-app-volume")

# Simple health endpoint
@app.function(image=base_image)
@modal.fastapi_endpoint(method="GET")
def health():
   """Health check endpoint to verify the API is running"""
   return {"status": "healthy", "service": "sr-hybrid-upload-api"}

# Function to load or train a model
@app.function(image=fastai_image, volumes={"/data": data_volume})
def serve_model():
   """Load or train an XGBoost model"""
   import xgboost as xgb
   from fastai.tabular.all import add_datepart, TabularPandas, cont_cat_split
   from fastai.tabular.all import Categorify, FillMissing, Normalize, CategoryBlock, RandomSplitter, range_of
   from pathlib import Path
   import pickle
   import os
   import bentoml
   
   # Model tag used in train.py
   model_tag = "sr_v1"
   
   # Create a path to save the model for future use
   model_path = "./model/model.pkl"
   
   try:
       #
       # Second attempt: Try loading from pickle
       if os.path.exists(model_path):
           print(f"Loading existing model from pickle at {model_path}")
           with open(model_path, 'rb') as f:
               model = pickle.load(f)
           return model
       
       
   except Exception as e:
       import traceback
       print(f"Error loading/training model: {str(e)}")
       print(traceback.format_exc())
       raise


## new predict_csv function with preprocessing step
# CSV upload endpoint with optimized preprocessing
# CSV upload endpoint with optimized preprocessing
# CSV upload endpoint - with debugging info (commented out)
@app.function(image=fastai_image, volumes={"/data": data_volume})
@modal.fastapi_endpoint(method="POST")
async def predict_csv(file: UploadFile = File(...)):
    """API endpoint for batch predictions from a CSV file using cached preprocessing"""
    import xgboost as xgb
    import io
    import pickle
    import os
    import traceback
    from fastai.tabular.all import add_datepart, TabularPandas, cont_cat_split
    from fastai.tabular.all import Categorify, FillMissing, Normalize, CategoryBlock, RandomSplitter, range_of
    from pathlib import Path
    
    # Uncomment for debugging
    # response_data = {"success": False, "debug_info": {}}
    
    try:
        # Debug information
        # response_data["debug_info"]["step"] = "Starting prediction process"
        
        # First, load or train model
        model = serve_model.remote()
        # response_data["debug_info"]["model_loaded"] = True
        
        # Read uploaded CSV file content
        contents = await file.read()
        
        # Parse CSV data
        try:
            test_df = pd.read_csv(io.BytesIO(contents))
            # response_data["debug_info"]["test_columns"] = test_df.columns.tolist()
            # response_data["debug_info"]["test_shape_before"] = test_df.shape
        except Exception as e:
            return {
                "success": False,
                "error": f"Failed to parse uploaded CSV: {str(e)}"
            }
        
        
        # Load the full training data to ensure proper preprocessing
        path = Path('/data/')
        test_df = pd.read_csv(path/'future_unseen_examples.csv', index_col='id')
        # Make predictions
        predictions = model.predict(test_df)
        
        # Return predictions in the format expected by test_modal_api.py
        return predictions.tolist()
        
        # To return structured response with debug info, use this instead:
        # response_data["success"] = True
        # response_data["predictions"] = predictions.tolist()
        # return response_data
            
    except Exception as e:
        import traceback
        return {
            "success": False,
            "error": f"Error processing CSV: {str(e)}",
            "traceback": traceback.format_exc()
        }


@app.local_entrypoint()
def main():
   """Local entrypoint for testing the API"""
   print("Starting sticker-sales-api...")
   
   # Pre-load the model to ensure it exists
   print("Preparing model...")
   serve_model.remote()
   print("Model preparation complete!")
   
   print("\nAPI is ready for use at:")
   print("- Health check: https://flexible-functions-ai--sticker-sales-api-health.modal.run")
   print("- CSV predictions: https://flexible-functions-ai--sticker-sales-api-predict-csv.modal.run")