# Guided Exercise: Drift

### Setup
You are the principal data scientist working for a new startup that offers a price recommendation for rental home listings. Your beach-head market was San Francisco and this was where you trained the model, which is the core service of the business. But now, the startup is looking to expand into Seattle and Austin. Using the mean price difference between San Francisco and each new city, you want to make sure your price recommendations don't drift. If they drift too low, your customers will leave money on the table; if they drift too high, their listing will be vacant. Hitting the goldilocks zone is critical for acquiring and keeping happy customers in Seattle.

Competitors in Seattle are within 65 dollars of the ideal price, and due to stiffer competition, competitors in Austin are within $40 of the ideal price. These are the benchmarks we need to hit to prove a viable product.

#### Goals 🎯
In this tutorial, you will learn how to:
1. Set up and view the results of stability tests.
2. Debug the true cause of stability issues.
3. Retest the new model and confirm the effectivenesss of the mitigation strategy.

### First, set the credentials for your TruEra deployment.
If you don't have credentials yet, get them instantly by signing up for free at: https://www.truera.com

In [1]:
# connection details
TRUERA_URL = "https://app.truera.net/"
AUTH_TOKEN = "..."

INFO:truera.client.remote_truera_workspace:Connecting to 'http://localhost:8000'
INFO:truera.client.remote_truera_workspace:remaining items: []
INFO:truera.client.remote_truera_workspace:Delete resource succeeded. Project_id: Starter Example Companion - Drift intra_artifact_path: 
INFO:truera.client.remote_truera_workspace:remaining items: []
INFO:truera.client.remote_truera_workspace:Delete resource succeeded. Project_id: Starter Example Companion - Drift intra_artifact_path: 
INFO:truera.client.remote_truera_workspace:remaining items: []
INFO:truera.client.remote_truera_workspace:Delete resource succeeded. Project_id: Starter Example Companion - Drift intra_artifact_path: 


In [None]:
! pip install --upgrade truera

### Install and import required packages for running in colab.

In [3]:
import logging
import pandas as pd
import sklearn.metrics
import xgboost as xgb
from sklearn import preprocessing
from sklearn.utils import resample

from truera.client.truera_workspace import TrueraWorkspace
from truera.client.truera_authentication import TokenAuthentication
from truera.client.ingestion import ModelOutputContext, ColumnSpec

auth = TokenAuthentication(AUTH_TOKEN)
tru = TrueraWorkspace(TRUERA_URL, auth)

INFO:truera.client.remote_truera_workspace:Connecting to 'http://localhost:8000'


### From here, run the rest of the notebook and follow the analysis.
### First, load data and train the in your beach-head market, San Francisco. Also add additional data for Seattle and Austin, your target markets.

In [4]:
# load data
san_francisco = pd.read_csv("https://truera-examples.s3.us-west-2.amazonaws.com/data/starter-stability/San_Francisco_for_stability.csv")
seattle = pd.read_csv("https://truera-examples.s3.us-west-2.amazonaws.com/data/starter-stability/Seattle_for_stability.csv")
austin = pd.read_csv("https://truera-examples.s3.us-west-2.amazonaws.com/data/starter-stability/Austin_for_stability.csv")

# make all float and make index ids
san_francisco = san_francisco.astype(float).reset_index(names="id")
seattle = seattle.astype(float).reset_index(names="id")
austin = austin.astype(float).reset_index(names="id")

### Create the project and set defaults for faster ingestion

In [5]:
# create the first project and data collection
project_name = "Starter Example Companion - Drift"
tru.add_project(project_name, score_type="regression")
tru.set_influence_type("shap")
tru.add_data_collection("Data Collection v1")

# reduce settings for speed
tru.set_num_internal_qii_samples(100)
tru.set_num_default_influences(100)

### After we've done so, the next thing to do is adding the data.

To do so, we'll introduce the add_data method along with the ColumnSpec. add_data is a general purpose method for adding data to truera and can be used to ingest a variety of data types including feature data, predictions, influences, labels and extra data.

ColumnSpec is a helper class we imported from truera.client.ingestion that we use to specify which columns in the dataframe are for what purpose. If you prefer, you can skip using ColumnSpec and just pass a dict instead, e.g. {"id_col_name": "id",...}

In [6]:
# add data to the collection we just created
tru.add_data(
    data=san_francisco,
    data_split_name="San Francisco",
    column_spec=ColumnSpec(
        id_col_name="id",
        pre_data_col_names=list(san_francisco.columns.drop(["id", "price"])),
        label_col_names="price")
)
tru.add_data(
    data=seattle,
    data_split_name="Seattle",
    column_spec=ColumnSpec(
        id_col_name="id",
        pre_data_col_names=list(seattle.columns.drop(["id", "price"])),
        label_col_names="price")
)
tru.add_data(
    data=austin,
    data_split_name="Austin",
    column_spec=ColumnSpec(
        id_col_name="id",
        pre_data_col_names=list(seattle.columns.drop(["id", "price"])),
        label_col_names="price")
)

Uploading tmpma_bpaco.parquet (224.4KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...


Uploading tmpvuahmkmh.parquet (147.9KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...


Uploading tmpede2pf5r.parquet (336.8KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...


### Train the model and register it in TruEra

In [7]:
# train first model
xgb_reg = xgb.XGBRegressor(eta=0.2, max_depth=4)
xgb_reg.fit(san_francisco.drop(["price", "id"], axis=1), san_francisco.price)

# register the model
tru.add_python_model(
    "model_1",
    xgb_reg,
    train_split_name="San Francisco",
    train_parameters={"model_type": "xgb.XGBRegressor", "eta": 0.2, "max_depth": 4},
    compute_predictions=False
)

INFO:truera.client.remote_truera_workspace:Uploading xgboost model: XGBRegressor
INFO:truera.client.remote_truera_workspace:Verifying model...
INFO:truera.client.remote_truera_workspace:✔️ Verified packaged model format.
INFO:truera.client.remote_truera_workspace:✔️ Loaded model in current environment.
INFO:truera.client.remote_truera_workspace:✔️ Called predict on model.
INFO:truera.client.remote_truera_workspace:✔️ Verified model output.
INFO:truera.client.remote_truera_workspace:Verification succeeded!


Uploading MLmodel (218.0B) -- ### -- file upload complete.
Uploading tmpwbw3o8b4.json (171.8KiB) -- ### -- file upload complete.
Uploading conda.yaml (208.0B) -- ### -- file upload complete.
Uploading xgboost_regression_predict_wrapper.py (459.0B) -- ### -- file upload complete.
Uploading xgboost_regression_predict_wrapper.cpython-310.pyc (1.1KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Model "model_1" is added and associated with remote data collection "Data Collection v1". "model_1" is set as the model for the workspace context.


Model uploaded to: http://localhost:8000/home/p/Starter%20Example%20Companion%20-%20Drift/m/model_1/


### Add predictions

For calculation, we'll use the built-in tru.get_ys_pred() to compute predictions.

Once we've computed our predictions we can add them again using add_data. We'll also use column_spec here, but this time we need to supply our prediction_col_names.

In [8]:
# predictions
tru.set_influences_background_data_split("San Francisco")
tru.set_data_split("San Francisco")
sf_preds = tru.get_ys_pred().reset_index(names="id")
tru.set_data_split("Seattle")
se_preds = tru.get_ys_pred().reset_index(names="id")
tru.set_data_split("Austin")
au_preds = tru.get_ys_pred().reset_index(names="id")

tru.add_data(
    data=sf_preds,
    data_split_name="San Francisco",
    column_spec=ColumnSpec(
        id_col_name="id",
        prediction_col_names="__truera_prediction__"
    )
)

tru.add_data(
    data=se_preds,
    data_split_name="Seattle",
    column_spec=ColumnSpec(
        id_col_name="id",
        prediction_col_names="__truera_prediction__"
    )
)

tru.add_data(
    data=au_preds,
    data_split_name="Austin",
    column_spec=ColumnSpec(
        id_col_name="id",
        prediction_col_names="__truera_prediction__"
    )
)

INFO:truera.client.truera_workspace:Download temp_dir: /var/folders/pg/2f8qcnr92cx4rcwpvm_x2ckc0000gn/T/tmpbvnidxr7
INFO:truera.client.truera_workspace:Syncing data collection "Data Collection v1" to local.
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v1". 
INFO:truera.client.truera_workspace:Syncing data split "San Francisco" to local.
INFO:truera.client.local.local_truera_workspace:Data split "San Francisco" is added to local data collection "Data Collection v1", and set as the data split for the workspace context.
INFO:truera.client.truera_workspace:Syncing model model_1 to local.
INFO:truera.client.local.local_truera_workspace:Model "model_1" is added and associated with local data collection "Data Collection v1". "model_1" is set as the model for the workspace context.
ERROR:truera.client.truera_workspace:Failed to sync explanation cache for model "model_1" and data_split "San Francisco" to local: Requested inf

Uploading tmpv4ab78tt.parquet (70.7KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.remote_truera_workspace:`model_output_context` will be inferred as it was not provided.
INFO:truera.client.remote_truera_workspace:Inferred ModelOutputContext: ModelOutputContext(model_name='model_1', score_type='regression', background_split_name='', influence_type='')


Uploading tmpk05uraqb.parquet (44.4KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.remote_truera_workspace:`model_output_context` will be inferred as it was not provided.
INFO:truera.client.remote_truera_workspace:Inferred ModelOutputContext: ModelOutputContext(model_name='model_1', score_type='regression', background_split_name='', influence_type='')


Uploading tmpyjvvsgm9.parquet (111.2KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...


### Computing and Ingesting Feature Influences

Feature influences are a core component of the TruEra platform that enables model explainability.

To compute them, we'll use get_feature_influences. We'll use opensource SHAP as our influence calculation method here. 

Note: TruEra-QII is available for paid users to provide faster and more accurate computation.

It's also a good time to introduce the Model Output Context. You should think of this as the metadata related to computing model outputs, including feature influences.

In [9]:
# feature influences

tru.set_influence_type("shap")

# reduce settings for speed
tru.set_num_internal_qii_samples(100)
tru.set_num_default_influences(100)

se_explainer = tru.get_explainer("Seattle")
se_infs = se_explainer.get_feature_influences().reset_index(names="id")

sf_explainer = tru.get_explainer("San Francisco")
sf_infs = sf_explainer.get_feature_influences().reset_index(names="id")

au_explainer = tru.get_explainer("Austin")
au_infs = sf_explainer.get_feature_influences().reset_index(names="id")

model_output_context = ModelOutputContext(
    model_name="model_1",
    score_type="regression",
    background_split_name="San Francisco",
    influence_type="kernel-shap")

tru.add_data(
    data=sf_infs,
    data_split_name="San Francisco",
    column_spec=ColumnSpec(
        id_col_name="id",
        feature_influence_col_names=list(sf_infs.columns.drop("id"))
    ),
    model_output_context=model_output_context
)

tru.add_data(
    data=se_infs,
    data_split_name="Seattle",
    column_spec=ColumnSpec(
        id_col_name="id",
        feature_influence_col_names=list(se_infs.columns.drop("id"))
    ),
    model_output_context=model_output_context
)

tru.add_data(
    data=au_infs,
    data_split_name="Austin",
    column_spec=ColumnSpec(
        id_col_name="id",
        feature_influence_col_names=list(se_infs.columns.drop("id"))
    ),
    model_output_context=model_output_context
)

INFO:truera.client.truera_workspace:Download temp_dir: /var/folders/pg/2f8qcnr92cx4rcwpvm_x2ckc0000gn/T/tmpbvnidxr7
INFO:truera.client.truera_workspace:Syncing data collection "Data Collection v1" to local.
INFO:truera.client.truera_workspace:Syncing segments groups from remote to local.
INFO:truera.client.truera_workspace:Download temp_dir: /var/folders/pg/2f8qcnr92cx4rcwpvm_x2ckc0000gn/T/tmpbvnidxr7
INFO:truera.client.truera_workspace:Syncing data collection "Data Collection v1" to local.
INFO:truera.client.truera_workspace:Syncing segments groups from remote to local.
INFO:truera.client.truera_workspace:Download temp_dir: /var/folders/pg/2f8qcnr92cx4rcwpvm_x2ckc0000gn/T/tmpbvnidxr7
INFO:truera.client.truera_workspace:Syncing data collection "Data Collection v1" to local.
INFO:truera.client.truera_workspace:Syncing segments groups from remote to local.


Uploading tmpsx7pswgf.parquet (37.9KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...


Uploading tmpak5k2z47.parquet (37.6KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...


Uploading tmp5utlnks9.parquet (37.9KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...


### Computing and Ingesting Error Influences

Error influences are a core component of the TruEra platform that enables model explainability.

To compute them, we'll again use get_feature_influences, except with a different score type: "mean_absolute_error_for_regression".

In [10]:
# error influences
model_output_context = ModelOutputContext(
    model_name="model_1",
    score_type="mean_absolute_error_for_regression",
    background_split_name="San Francisco",
    influence_type="kernel-shap")

tru.set_data_split("San Francisco")
sf_error_infs = tru.get_feature_influences(score_type="mean_absolute_error_for_regression").reset_index(names="id")

tru.set_data_split("Seattle")
se_error_infs = tru.get_feature_influences(score_type="mean_absolute_error_for_regression").reset_index(names="id")

tru.set_data_split("Austin")
au_error_infs = tru.get_feature_influences(score_type="mean_absolute_error_for_regression").reset_index(names="id")

tru.add_data(
    data=sf_error_infs,
    data_split_name="San Francisco",
    column_spec=ColumnSpec(
        id_col_name="id",
        feature_influence_col_names=list(sf_error_infs.columns.drop("id"))
    ),
    model_output_context=model_output_context
)

tru.add_data(
    data=se_error_infs,
    data_split_name="Seattle",
    column_spec=ColumnSpec(
        id_col_name="id",
        feature_influence_col_names=list(se_error_infs.columns.drop("id"))
    ),
    model_output_context=model_output_context
)

tru.add_data(
    data=au_error_infs,
    data_split_name="Austin",
    column_spec=ColumnSpec(
        id_col_name="id",
        feature_influence_col_names=list(se_error_infs.columns.drop("id"))
    ),
    model_output_context=model_output_context
)

INFO:truera.client.truera_workspace:Download temp_dir: /var/folders/pg/2f8qcnr92cx4rcwpvm_x2ckc0000gn/T/tmpbvnidxr7
INFO:truera.client.truera_workspace:Syncing data collection "Data Collection v1" to local.
INFO:truera.client.truera_workspace:Syncing segments groups from remote to local.


|          | 0.000% [00:00<?]

INFO:truera.client.truera_workspace:Download temp_dir: /var/folders/pg/2f8qcnr92cx4rcwpvm_x2ckc0000gn/T/tmpbvnidxr7
INFO:truera.client.truera_workspace:Syncing data collection "Data Collection v1" to local.
INFO:truera.client.truera_workspace:Syncing segments groups from remote to local.


|          | 0.000% [00:00<?]

INFO:truera.client.truera_workspace:Download temp_dir: /var/folders/pg/2f8qcnr92cx4rcwpvm_x2ckc0000gn/T/tmpbvnidxr7
INFO:truera.client.truera_workspace:Syncing data collection "Data Collection v1" to local.
INFO:truera.client.truera_workspace:Syncing segments groups from remote to local.


|          | 0.000% [00:00<?]

Uploading tmp6ztx7hxr.parquet (33.7KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...


Uploading tmplj8_gs7v.parquet (34.4KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...


Uploading tmpyjjgb976.parquet (34.0KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...


### Get the average ground truth price in each city to use for defining our drift (stability) test thresholds.

In [11]:
tru.set_data_split("San Francisco")
san_francisco_mean_price = tru.get_ys().mean()
tru.set_data_split("Seattle")
seattle_mean_price = tru.get_ys().mean()
tru.set_data_split("Austin")
austin_mean_price = tru.get_ys().mean()

print("San Francisco mean listing price: " + str(san_francisco_mean_price))
print("Seattle mean listing price: " + str(seattle_mean_price))
print("Austin mean listing price: " + str(austin_mean_price))

# calculate expected difference in price recommendations from beach-head to target market
seattle_expected_difference = seattle_mean_price - san_francisco_mean_price
austin_expected_difference = austin_mean_price - san_francisco_mean_price

print("Expected price difference from San Francisco to Seattle: " + str(seattle_expected_difference))
print("Expected price difference from San Francisco to Austin: " + str(austin_expected_difference))

San Francisco mean listing price: __truera_label__    205.25581
dtype: float64
Seattle mean listing price: __truera_label__    127.8074
dtype: float64
Austin mean listing price: __truera_label__    227.011264
dtype: float64
Expected price difference from San Francisco to Seattle: __truera_label__   -77.44841
dtype: float64
Expected price difference from San Francisco to Austin: __truera_label__    21.755454
dtype: float64


### Test for drift (stability) in Seattle and Austin.

In [12]:
# add stability test

# create stability tests in accordance with the setup
tru.tester.add_stability_test(
    test_name="Stability Test - Seattle",
    base_data_split_name="San Francisco",
    comparison_data_split_name_regex="Seattle",
    fail_if_outside=[seattle_expected_difference - 65, seattle_expected_difference + 65])

tru.tester.add_stability_test(
    test_name="Stability Test - Austin",
    base_data_split_name="San Francisco",
    comparison_data_split_names=["Austin"],
    fail_if_outside=[austin_expected_difference - 40, austin_expected_difference + 40])

tru.set_model("model_1")
tru.tester.get_model_test_results(test_types=["stability"])

0,1,2,3,4,5,6,7
,Name,Comparison Split,Base Split,Segment,Metric,Score,Navigate
❌,Stability Test - Seattle,Seattle,San Francisco,ALL POINTS,DIFFERENCE_OF_MEAN,-4.5582,Explore in UI
❌,Stability Test - Austin,Austin,San Francisco,ALL POINTS,DIFFERENCE_OF_MEAN,64.0755,Explore in UI


The model fails in Seattle and Austin because the scores drifted too far from the ground truth in the new cities.

### From here, navigate to the TruEra Web App for analysis or continue on to Part 2!     [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/truera/truera-examples/blob/release/rc-1.37/starter-examples/starter-drift-part-2.ipynb)