# Building and using schemas for structured data
This tutorial uses the Ames Housing dataset: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview


In [1]:
import pandas as pd

Let's start by reading a subset of the dataset. This subset just happens to mostly contain houses on the cheaper end of the price spectrum, which will come in handy later.

See the file `./data_sample/houseprices/data_description.txt` for an explanation of what the features mean. For now, it's just important to know that this is a dataset that contains both numerical and categorial features.

In [2]:
DATA_PATH = "./data_sample/houseprices/subset-cheap.csv"
all_data = pd.read_csv(DATA_PATH).drop("Id", axis="columns")
all_data.head(5)

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,2,2006,WD,Abnorml,140000
1,50,RL,85.0,14115,Pave,,IR1,Lvl,AllPub,Inside,...,0,,MnPrv,Shed,700,10,2009,WD,Normal,143000
2,50,RM,51.0,6120,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,4,2008,WD,Abnorml,129900
3,190,RL,50.0,7420,Pave,,Reg,Lvl,AllPub,Corner,...,0,,,,0,1,2008,WD,Normal,118000
4,20,RL,70.0,11200,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,129500


## The cumbersome way
One (cumbersome) way to define a schema for this dataframe is by specifying all features to be tracked manually. This is shown below for some features.

In [3]:
from rdv.schema import Schema
from rdv.extractors.structured import ElementExtractor
from rdv.feature import CategoricFeature, IntFeature, FloatFeature

schema_manual = Schema(
    # A schema needs a name and version
    name="manually-defined", 
    version="0.0.1", 
    # Now, we specify the features we want to track.
    features=[
        IntFeature(name="MSSubClass", extractor=ElementExtractor("MSSubClass")),
        CategoricFeature(name="MSZoning", extractor=ElementExtractor("MSZoning")),
        FloatFeature(name="LotFrontage", extractor=ElementExtractor("LotFrontage")),
    ])
print(schema_manual)
schema_manual.features

Schema(name="manually-defined", version="0.0.1"


{'MSSubClass': IntFeature(name=MSSubClass, extractor=ElementExtractor(element=MSSubClass)),
 'MSZoning': CategoricFeature(name=MSZoning, extractor=ElementExtractor(element=MSZoning)),
 'LotFrontage': FloatFeature(name=LotFrontage, extractor=ElementExtractor(element=LotFrontage))}

## Simplified way
Since we simply want to extract every single feature from a feature vector in this case, the above seems needlessly cumbersome for a dataframe with 80 columns. Therefore, you can use the `construct_features` function, a shown below.

If you'd want to extract another kind of feature too, like a one-hot check or a subvector norm check, you could combine both approaches into one list before passing it to the schema constructor. 

In [4]:
from rdv.extractors.structured import construct_features

schema = Schema(
    name="auto-defined", 
    version="0.0.1", 
    features=construct_features(all_data.dtypes))
print(schema)
schema.features

Schema(name="auto-defined", version="0.0.1"


{'MSSubClass': IntFeature(name=MSSubClass, extractor=ElementExtractor(element=MSSubClass)),
 'MSZoning': CategoricFeature(name=MSZoning, extractor=ElementExtractor(element=MSZoning)),
 'LotFrontage': FloatFeature(name=LotFrontage, extractor=ElementExtractor(element=LotFrontage)),
 'LotArea': IntFeature(name=LotArea, extractor=ElementExtractor(element=LotArea)),
 'Street': CategoricFeature(name=Street, extractor=ElementExtractor(element=Street)),
 'Alley': CategoricFeature(name=Alley, extractor=ElementExtractor(element=Alley)),
 'LotShape': CategoricFeature(name=LotShape, extractor=ElementExtractor(element=LotShape)),
 'LandContour': CategoricFeature(name=LandContour, extractor=ElementExtractor(element=LandContour)),
 'Utilities': CategoricFeature(name=Utilities, extractor=ElementExtractor(element=Utilities)),
 'LotConfig': CategoricFeature(name=LotConfig, extractor=ElementExtractor(element=LotConfig)),
 'LandSlope': CategoricFeature(name=LandSlope, extractor=ElementExtractor(element=La

## Building the schema

Now that we have defined how the data looks like, we can build the schema. This will build stats for every registered feature.

In [5]:
schema.build(data=all_data)

#Let's also save it for later use.
schema.save("houses-cheap-built.json")

Compiling stats for MSSubClass
Compiling stats for MSZoning
Compiling stats for LotFrontage
Compiling stats for LotArea
Compiling stats for Street
Compiling stats for Alley
Compiling stats for LotShape
Compiling stats for LandContour
Compiling stats for Utilities
Compiling stats for LotConfig
Compiling stats for LandSlope
Compiling stats for Neighborhood
Compiling stats for Condition1
Compiling stats for Condition2
Compiling stats for BldgType
Compiling stats for HouseStyle
Compiling stats for OverallQual
Compiling stats for OverallCond
Compiling stats for YearBuilt
Compiling stats for YearRemodAdd
Compiling stats for RoofStyle
Compiling stats for RoofMatl
Compiling stats for Exterior1st
Compiling stats for Exterior2nd
Compiling stats for MasVnrType
Compiling stats for MasVnrArea
Compiling stats for ExterQual
Compiling stats for ExterCond
Compiling stats for Foundation
Compiling stats for BsmtQual
Compiling stats for BsmtCond
Compiling stats for BsmtExposure
Compiling stats for BsmtFin

## Inspecting the schema
RDV offers toolign to inspect the schema's that are built. Let's laod the schema (just because we can) and inspect it.

In [6]:
schema = Schema.load("houses-cheap-built.json")
schema.view()

Using JupyterDash


Alternatively, we can plot a certain point of interest (poi) on the schema, to see how it compares to the training distributions. We can also specify that we want to show the schema in a new broser window.

In [7]:
schema.view(poi=all_data.iloc[2, :], mode="external")

Using JupyterDash
Dash app running on http://127.0.0.1:8050/


## Validating new data
To use the data schema to check incomping data in your production system, simply load it from JSON and call `check(data)`. This will output tags that can be used as metric in any monitoring platform, but they integrate perticularly well with [Raymon.ai](https://raymon.ai)

In [8]:
row = all_data.iloc[-1, :]
tags = schema.check(row)
tags[:10]

[{'type': 'schema-feature',
  'name': 'MSSubClass',
  'value': 60.0,
  'group': 'auto-defined@0.0.1'},
 {'type': 'schema-feature',
  'name': 'MSZoning',
  'value': 'RL',
  'group': 'auto-defined@0.0.1'},
 {'type': 'schema-feature',
  'name': 'LotFrontage',
  'value': 62.0,
  'group': 'auto-defined@0.0.1'},
 {'type': 'schema-feature',
  'name': 'LotArea',
  'value': 7917.0,
  'group': 'auto-defined@0.0.1'},
 {'type': 'schema-feature',
  'name': 'Street',
  'value': 'Pave',
  'group': 'auto-defined@0.0.1'},
 {'type': 'schema-feature',
  'name': 'Alley',
  'value': nan,
  'group': 'auto-defined@0.0.1'},
 {'type': 'schema-error',
  'name': 'Alley-err',
  'value': 'Value NaN',
  'group': 'auto-defined@0.0.1'},
 {'type': 'schema-feature',
  'name': 'LotShape',
  'value': 'Reg',
  'group': 'auto-defined@0.0.1'},
 {'type': 'schema-feature',
  'name': 'LandContour',
  'value': 'Lvl',
  'group': 'auto-defined@0.0.1'},
 {'type': 'schema-feature',
  'name': 'Utilities',
  'value': 'AllPub',
  'gro

There are a few things of note here. 
First of all, all the extracted feature values are returned. This is useful for when you want to track feature distributions on your monitoring backend (which is what happens on the Raymon.ai platform). Also note that these features are not necessarily the ones going into your ML model.

Secondly, the feature `Alley` gives rise to 2 tags: one being the feature (`nan`) and one being a schema error, indicating that `nan` is not a valid feature value. Raymon will also check whether the data under test is between the observed `min` and `max` during building. If this is not the case, an error tag will be added for that feature. These error tags can also be sent to your preferred monitoring solution to track the amount of faulty data in your system.

In [9]:
tags[5:7]

[{'type': 'schema-feature',
  'name': 'Alley',
  'value': nan,
  'group': 'auto-defined@0.0.1'},
 {'type': 'schema-error',
  'name': 'Alley-err',
  'value': 'Value NaN',
  'group': 'auto-defined@0.0.1'}]

The output above is structured for easy integration with the Raymon.ai platform, you can also return the tags as normal objects, that can be converted to any form you like to integrate with your monitoring solution.

In [10]:
tags = schema.check(row, convert_json=False)
tags[:10]

[Tag(name='MSSubClass, value=60.0, type=schema-feature, group=auto-defined@0.0.1,
 Tag(name='MSZoning, value=RL, type=schema-feature, group=auto-defined@0.0.1,
 Tag(name='LotFrontage, value=62.0, type=schema-feature, group=auto-defined@0.0.1,
 Tag(name='LotArea, value=7917.0, type=schema-feature, group=auto-defined@0.0.1,
 Tag(name='Street, value=Pave, type=schema-feature, group=auto-defined@0.0.1,
 Tag(name='Alley, value=nan, type=schema-feature, group=auto-defined@0.0.1,
 Tag(name='Alley-err, value=Value NaN, type=schema-error, group=auto-defined@0.0.1,
 Tag(name='LotShape, value=Reg, type=schema-feature, group=auto-defined@0.0.1,
 Tag(name='LandContour, value=Lvl, type=schema-feature, group=auto-defined@0.0.1,
 Tag(name='Utilities, value=AllPub, type=schema-feature, group=auto-defined@0.0.1]

## Comparing schemas
Checking for invalid feature values only says so much. Comparing distributions tells more. This is exactly what can be done with the `schema.compare(other)` method, as illustrated below. This function will perform a statistical test on every feature to check whether they have the same distribution.

In [11]:

exp_data = pd.read_csv("./data_sample/houseprices/subset-exp.csv").drop("Id", axis="columns")
schema_exp = Schema(
    name="expensive", 
    version="0.0.1", 
    features=construct_features(all_data.dtypes))
    
schema_exp.build(exp_data)
schema.compare(schema_exp)

Compiling stats for MSSubClass
Compiling stats for MSZoning
Compiling stats for LotFrontage
Compiling stats for LotArea
Compiling stats for Street
Compiling stats for Alley
Compiling stats for LotShape
Compiling stats for LandContour
Compiling stats for Utilities
Compiling stats for LotConfig
Compiling stats for LandSlope
Compiling stats for Neighborhood
Compiling stats for Condition1
Compiling stats for Condition2
Compiling stats for BldgType
Compiling stats for HouseStyle
Compiling stats for OverallQual
Compiling stats for OverallCond
Compiling stats for YearBuilt
Compiling stats for YearRemodAdd
Compiling stats for RoofStyle
Compiling stats for RoofMatl
Compiling stats for Exterior1st
Compiling stats for Exterior2nd
Compiling stats for MasVnrType
Compiling stats for MasVnrArea
Compiling stats for ExterQual
Compiling stats for ExterCond
Compiling stats for Foundation
Compiling stats for BsmtQual
Compiling stats for BsmtCond
Compiling stats for BsmtExposure
Compiling stats for BsmtFin

As we can see, most features have a different distriution between those 2 schemas. This is as expected: one is built for houses on the cheap end of the price spectrum, the other on houses on the expensive end. Finding out about such distribution shifts is important to be able to maintain reliable ML systems.

Note: comparing schemas like this is exactly what we do on the Raymon.ai backend.

As a sanity check, we can sample the same dataframe twice and see whether there are distribution changes detected.

In [12]:
dfs1 = exp_data.sample(frac=0.6)
dfs2 = exp_data.sample(frac=0.6)

s1schema = Schema(
    name="s1", 
    features=construct_features(all_data.dtypes)
    )

s2schema = Schema(
name="s1", 
features=construct_features(all_data.dtypes)
)
    
s1schema.build(dfs1)
s2schema.build(dfs2)


s1schema.compare(s2schema)

Compiling stats for MSSubClass
Compiling stats for MSZoning
Compiling stats for LotFrontage
Compiling stats for LotArea
Compiling stats for Street
Compiling stats for Alley
Compiling stats for LotShape
Compiling stats for LandContour
Compiling stats for Utilities
Compiling stats for LotConfig
Compiling stats for LandSlope
Compiling stats for Neighborhood
Compiling stats for Condition1
Compiling stats for Condition2
Compiling stats for BldgType
Compiling stats for HouseStyle
Compiling stats for OverallQual
Compiling stats for OverallCond
Compiling stats for YearBuilt
Compiling stats for YearRemodAdd
Compiling stats for RoofStyle
Compiling stats for RoofMatl
Compiling stats for Exterior1st
Compiling stats for Exterior2nd
Compiling stats for MasVnrType
Compiling stats for MasVnrArea
Compiling stats for ExterQual
Compiling stats for ExterCond
Compiling stats for Foundation
Compiling stats for BsmtQual
Compiling stats for BsmtCond
Compiling stats for BsmtExposure
Compiling stats for BsmtFin