# Building and using Profiles for structured data
This tutorial uses the Ames Housing dataset: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview

Note that some outputs may not be working while viewing on Github since they are shown in an iframe. We recommend to clone this repo and execute the notebooks locally.

In [1]:
%load_ext autoreload
%autoreload 2
import pandas as pd

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Let's start by reading a subset of the dataset. This subset just happens to mostly contain houses on the cheaper end of the price spectrum, which will come in handy later.

See the file `./data_sample/houseprices/data_description.txt` for an explanation of what the features mean. For now, it's just important to know that this is a dataset that contains both numerical and categorial features.

In [2]:
DATA_PATH = "./data_sample/houseprices/subset-cheap.csv"
all_data = pd.read_csv(DATA_PATH).drop("Id", axis="columns")
all_data.head(5)

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,2,2006,WD,Abnorml,140000
1,50,RL,85.0,14115,Pave,,IR1,Lvl,AllPub,Inside,...,0,,MnPrv,Shed,700,10,2009,WD,Normal,143000
2,50,RM,51.0,6120,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,4,2008,WD,Abnorml,129900
3,190,RL,50.0,7420,Pave,,Reg,Lvl,AllPub,Corner,...,0,,,,0,1,2008,WD,Normal,118000
4,20,RL,70.0,11200,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,129500


## The cumbersome way
One (cumbersome) way to define a Profile for this dataframe is by specifying all features to be tracked manually. This is shown below for some features.

In [3]:
from raymon import ModelProfile
from raymon.profiling.extractors.structured import ElementExtractor
from raymon import InputComponent

profile_manual = ModelProfile(
    # A schema needs a name and version
    name="manually-defined", 
    version="0.0.1", 
    # Now, we specify the features we want to track.
    components=[
        InputComponent(name="MSSubClass", extractor=ElementExtractor("MSSubClass")),
        InputComponent(name="MSZoning", extractor=ElementExtractor("MSZoning")),
        InputComponent(name="LotFrontage", extractor=ElementExtractor("LotFrontage")),
    ])
print(profile_manual)
profile_manual.components

ModelProfile(name="manually-defined", version="0.0.1"


{'mssubclass': InputComponent(name=mssubclass, dtype=FLOAT, extractor=ElementExtractor(element=MSSubClass)),
 'mszoning': InputComponent(name=mszoning, dtype=FLOAT, extractor=ElementExtractor(element=MSZoning)),
 'lotfrontage': InputComponent(name=lotfrontage, dtype=FLOAT, extractor=ElementExtractor(element=LotFrontage))}

## Simplified way
Since we simply want to extract every single feature from a feature vector in this case, the above seems needlessly cumbersome for a dataframe with 80 columns. Therefore, you can use the `generate_components` function, a shown below.

If you'd want to extract another kind of feature too, like a one-hot check or a subvector norm check, you could combine both approaches into one list before passing it to the schema constructor. 

In [4]:
from raymon.profiling.extractors.structured import generate_components

profile = ModelProfile(
    name="houses_cheap", 
    version="0.0.1", 
    components=generate_components(all_data.dtypes),
    )
print(profile)
profile.components

ModelProfile(name="houses_cheap", version="0.0.1"


{'mssubclass': InputComponent(name=mssubclass, dtype=INT, extractor=ElementExtractor(element=MSSubClass)),
 'mszoning': InputComponent(name=mszoning, dtype=CAT, extractor=ElementExtractor(element=MSZoning)),
 'lotfrontage': InputComponent(name=lotfrontage, dtype=FLOAT, extractor=ElementExtractor(element=LotFrontage)),
 'lotarea': InputComponent(name=lotarea, dtype=INT, extractor=ElementExtractor(element=LotArea)),
 'street': InputComponent(name=street, dtype=CAT, extractor=ElementExtractor(element=Street)),
 'alley': InputComponent(name=alley, dtype=CAT, extractor=ElementExtractor(element=Alley)),
 'lotshape': InputComponent(name=lotshape, dtype=CAT, extractor=ElementExtractor(element=LotShape)),
 'landcontour': InputComponent(name=landcontour, dtype=CAT, extractor=ElementExtractor(element=LandContour)),
 'utilities': InputComponent(name=utilities, dtype=CAT, extractor=ElementExtractor(element=Utilities)),
 'lotconfig': InputComponent(name=lotconfig, dtype=CAT, extractor=ElementExtract

## Building the Profile

Now that we have defined how the data looks like, we can build the profile. This will build stats for all registered components.


In [5]:
profile.build(input=all_data, silent=False)

#Let's also save it for later use.
profile.save(".")

mssubclass
mszoning
lotfrontage
lotarea
street
alley
lotshape
landcontour
utilities
lotconfig
landslope
neighborhood
condition1
condition2
bldgtype
housestyle
overallqual
overallcond
yearbuilt
yearremodadd
roofstyle
roofmatl
exterior1st
exterior2nd
masvnrtype
masvnrarea
exterqual
extercond
foundation
bsmtqual
bsmtcond
bsmtexposure
bsmtfintype1
bsmtfinsf1
bsmtfintype2
bsmtfinsf2
bsmtunfsf
totalbsmtsf
heating
heatingqc
centralair
electrical
1stflrsf
2ndflrsf
lowqualfinsf
grlivarea
bsmtfullbath
bsmthalfbath
fullbath
halfbath
bedroomabvgr
kitchenabvgr
kitchenqual
totrmsabvgrd
functional
fireplaces
fireplacequ
garagetype
garageyrblt
garagefinish
garagecars
garagearea
garagequal
garagecond
paveddrive
wooddecksf
openporchsf
enclosedporch
3ssnporch
screenporch
poolarea
poolqc
fence
miscfeature
miscval
mosold
yrsold
saletype
salecondition
saleprice


## Inspecting the schema
RDV offers tooling to inspect the schema's that are built. Let's load the schema (just because we can) and inspect it.

In [6]:
from IPython.display import IFrame
from pathlib import Path
profile = ModelProfile.load("houses_cheap@0.0.1.json")
profile.view(mode='external', outdir=Path(".").absolute())

PosixPath('/Users/kv/Raymon/Code/raymon/examples/.tmpgno7c484/schema.html')

Alternatively, we can plot a certain point of interest (poi) on the profile, to see how it compares to the training distributions. We can also specify that we want to show the profile in a new broser window.

In [7]:
profile.view(poi=all_data.iloc[2, :], mode="external")

PosixPath('/var/folders/cn/ht7pqf_j1hg6l7b552dnfvrw0000gn/T/.tmp5kyk07w7/schema.html')

## Validating new data
To use the data profile to check incoming data in your production system, simply load it from JSON and call `validate_input(data)`. This will output tags that can be used as metric in any monitoring platform, but they integrate perticularly well with [Raymon.ai](https://raymon.ai)

In [8]:
row = all_data.iloc[-1, :]
tags = profile.validate_input(row)
tags[:10]

[{'type': 'profile-input',
  'name': 'mssubclass',
  'value': 60,
  'group': 'houses_cheap@0.0.1'},
 {'type': 'profile-input',
  'name': 'mszoning',
  'value': 'RL',
  'group': 'houses_cheap@0.0.1'},
 {'type': 'profile-input',
  'name': 'lotfrontage',
  'value': 62.0,
  'group': 'houses_cheap@0.0.1'},
 {'type': 'profile-input',
  'name': 'lotarea',
  'value': 7917,
  'group': 'houses_cheap@0.0.1'},
 {'type': 'profile-input',
  'name': 'street',
  'value': 'Pave',
  'group': 'houses_cheap@0.0.1'},
 {'type': 'profile-input-error',
  'name': 'alley-error',
  'value': 'Value NaN',
  'group': 'houses_cheap@0.0.1'},
 {'type': 'profile-input',
  'name': 'lotshape',
  'value': 'Reg',
  'group': 'houses_cheap@0.0.1'},
 {'type': 'profile-input',
  'name': 'landcontour',
  'value': 'Lvl',
  'group': 'houses_cheap@0.0.1'},
 {'type': 'profile-input',
  'name': 'utilities',
  'value': 'AllPub',
  'group': 'houses_cheap@0.0.1'},
 {'type': 'profile-input',
  'name': 'lotconfig',
  'value': 'Inside',
 

There are a few things of note here. 
First of all, all the extracted feature values are returned. This is useful for when you want to track feature distributions on your monitoring backend (which is what happens on the Raymon.ai platform). Also note that these features are not necessarily the ones going into your ML model.

Secondly, the feature `Alley` gives rise to a profile error, indicating that `nan` is not a valid feature value. Raymon will also check whether the data under test is between the observed `min` and `max` during building. If this is not the case, an error tag will be added for that feature. These error tags can also be sent to your preferred monitoring solution to track the amount of faulty data in your system.

In [9]:
[t for t in tags if 'alley' in t["name"]]

[{'type': 'profile-input-error',
  'name': 'alley-error',
  'value': 'Value NaN',
  'group': 'houses_cheap@0.0.1'}]

In [10]:
all_data.iloc[-1]["Alley"]

nan

The output above is returned a a list of dicts. You can also return the tags as Tag objects.

In [11]:
tags = profile.validate_input(row, convert_json=False)
tags[:10]

[Tag(name='mssubclass, value=60, type=profile-input, group=houses_cheap@0.0.1,
 Tag(name='mszoning, value=RL, type=profile-input, group=houses_cheap@0.0.1,
 Tag(name='lotfrontage, value=62.0, type=profile-input, group=houses_cheap@0.0.1,
 Tag(name='lotarea, value=7917, type=profile-input, group=houses_cheap@0.0.1,
 Tag(name='street, value=Pave, type=profile-input, group=houses_cheap@0.0.1,
 Tag(name='alley-error, value=Value NaN, type=profile-input-error, group=houses_cheap@0.0.1,
 Tag(name='lotshape, value=Reg, type=profile-input, group=houses_cheap@0.0.1,
 Tag(name='landcontour, value=Lvl, type=profile-input, group=houses_cheap@0.0.1,
 Tag(name='utilities, value=AllPub, type=profile-input, group=houses_cheap@0.0.1,
 Tag(name='lotconfig, value=Inside, type=profile-input, group=houses_cheap@0.0.1]

## Comparing profiles
Testing for invalid feature values only says so much. Comparing distributions tells more. This is exactly what can be done with the `profile.contrast` method, as illustrated below. This function will test every component to check whether they have the same distribution.

In [12]:
import json
exp_data = pd.read_csv("./data_sample/houseprices/subset-exp.csv").drop("Id", axis="columns")
profile_exp = ModelProfile(
    name="houses_exp", 
    version="0.0.1", 
    components=generate_components(exp_data.dtypes),
    )
profile_exp.build(input=exp_data)

profile.view_contrast(profile_exp, mode="external")


PosixPath('/var/folders/cn/ht7pqf_j1hg6l7b552dnfvrw0000gn/T/.tmpttkapehw/schema.html')

In [13]:
contrast_report = profile.contrast(profile_exp)

with open('contrast.json', 'w') as f:
    json.dump(contrast_report, f, indent=4)

As we can see, most features have a different distriution between those 2 schemas. This is as expected: one is built for houses on the cheap end of the price spectrum, the other on houses on the expensive end. Finding out about such distribution shifts is important to be able to maintain reliable ML systems.

Note: comparing schemas like this is exactly what we do on the Raymon.ai backend.

As a sanity check, we can sample the same dataframe twice and see whether there are distribution changes detected.

In [14]:
dfs1 = exp_data.sample(frac=0.6)
dfs2 = exp_data.sample(frac=0.6)

s1schema = ModelProfile(
    name="s1", 
    components=generate_components(all_data.dtypes)
    )

s2schema = ModelProfile(
name="s2", 
components=generate_components(all_data.dtypes)
)
    
s1schema.build(dfs1)
s2schema.build(dfs2)

s1schema.view_contrast(s2schema, mode="external")



PosixPath('/var/folders/cn/ht7pqf_j1hg6l7b552dnfvrw0000gn/T/.tmpre535hja/schema.html')