# Housing value estimation model training

Let's train a simple regressor using Scikit-Learn, and convert the pipeline to ONNX format.

In [1]:
from pathlib import Path

import numpy as np
import onnxruntime as ort
import pandas as pd
import skl2onnx
from skl2onnx.common.data_types import FloatTensorType
from sklearn.linear_model import LinearRegression
from sklearn.metrics import root_mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

Load the french housing dataset for Isère department in 2024:

In [2]:
dvf_38 = pd.read_csv(
    "https://files.data.gouv.fr/geo-dvf/latest/csv/2024/departements/38.csv.gz"
)
dvf_38.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58427 entries, 0 to 58426
Data columns (total 40 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   id_mutation                   58427 non-null  object 
 1   date_mutation                 58427 non-null  object 
 2   numero_disposition            58427 non-null  int64  
 3   nature_mutation               58427 non-null  object 
 4   valeur_fonciere               57288 non-null  float64
 5   adresse_numero                36666 non-null  float64
 6   adresse_suffixe               2141 non-null   object 
 7   adresse_nom_voie              58056 non-null  object 
 8   adresse_code_voie             58056 non-null  object 
 9   code_postal                   58052 non-null  float64
 10  code_commune                  58427 non-null  int64  
 11  nom_commune                   58427 non-null  object 
 12  code_departement              58427 non-null  int64  
 13  a

  dvf_38 = pd.read_csv(


Prepare the dataset to keep only sales of apartments in Grenoble:

In [3]:
dataset = dvf_38.copy()
dataset = dataset[
    (dataset.nature_mutation == "Vente")
    & (dataset.type_local == "Appartement")
    & (dataset.nom_commune == "Grenoble")
]
dataset = dataset[
    [
        "surface_reelle_bati",
        "nombre_pieces_principales",
        "latitude",
        "longitude",
        "valeur_fonciere",
    ]
]
dataset = dataset.rename(
    columns={
        "surface_reelle_bati": "area",
        "nombre_pieces_principales": "rooms",
        "valeur_fonciere": "value",
    }
)
dataset = dataset.dropna()
dataset = dataset.reset_index()
dataset

Unnamed: 0,index,area,rooms,latitude,longitude,value
0,8,39.0,1.0,45.175786,5.711498,97900.0
1,18,76.0,3.0,45.175551,5.712367,125800.0
2,21,63.0,4.0,45.174950,5.714996,130000.0
3,22,60.0,3.0,45.183757,5.720816,195000.0
4,39,86.0,4.0,45.165827,5.724160,204800.0
...,...,...,...,...,...,...
2475,36122,10.0,1.0,45.192335,5.727845,254600.0
2476,36123,72.0,4.0,45.192335,5.727845,254600.0
2477,36151,62.0,2.0,45.191966,5.728394,129000.0
2478,36211,38.0,1.0,45.196528,5.737606,107500.0


Split the dataset for train and test sets:

In [4]:
X = dataset[["area", "rooms", "latitude", "longitude"]]
y = dataset["value"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

Train a Scikit-Learn pipeline, including the normalization step and a regression model:

In [5]:
pipeline = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("regressor", LinearRegression()),
    ]
)
pipeline.fit(X_train, y_train)

0,1,2
,steps,"[('scaler', ...), ('regressor', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


Score the model (RMSE) on the test set:

In [6]:
root_mean_squared_error(y_test, pipeline.predict(X_test))

209298.56486887162

Try to predict the value of an apartment (50m2, 3 rooms, Victor Hugo place in Grenoble):

In [7]:
pipeline.predict([[50, 3, 45.1893525, 5.7216074]])



array([200020.95348386])

Export the model to ONNX format using `skl2onnx`:

In [8]:
onnx_model = skl2onnx.to_onnx(
    pipeline,
    X_train[:1].astype(np.float32),
    target_opset=12,
    final_types=[("value", FloatTensorType([None, 1]))],
)
onnx_model_path = Path() / "model.onnx"
onnx_model_path.write_bytes(onnx_model.SerializeToString())
onnx_model.graph

node {
  input: "area"
  input: "rooms"
  input: "latitude"
  input: "longitude"
  output: "concatenated"
  name: "FeatureVectorizer"
  op_type: "FeatureVectorizer"
  attribute {
    name: "inputdimensions"
    ints: 1
    ints: 1
    ints: 1
    ints: 1
    type: INTS
  }
  domain: "ai.onnx.ml"
}
node {
  input: "concatenated"
  output: "variable"
  name: "Scaler"
  op_type: "Scaler"
  attribute {
    name: "offset"
    floats: 59.0145149
    floats: 2.57903218
    floats: 45.1823
    floats: 5.72310352
    type: FLOATS
  }
  attribute {
    name: "scale"
    floats: 0.033674147
    floats: 0.790385902
    floats: 114.907623
    floats: 89.3178787
    type: FLOATS
  }
  domain: "ai.onnx.ml"
}
node {
  input: "variable"
  output: "variable1"
  name: "LinearRegressor"
  op_type: "LinearRegressor"
  attribute {
    name: "coefficients"
    floats: 63193.7617
    floats: -3264.80957
    floats: 49867.2188
    floats: 24691.4941
    type: FLOATS
  }
  attribute {
    name: "intercepts"
   

Load the ONNX model and run an inference on the sample data:

In [9]:
session = ort.InferenceSession(onnx_model_path, providers=ort.get_available_providers())
session.run(
    None,
    {
        "area": [[50.0]],
        "rooms": [[3.0]],
        "latitude": [[45.1893525]],
        "longitude": [[5.7216074]],
    },
)

[array([[200039.75]], dtype=float32)]