# Guided Exercise: Drift

### Setup:
You are the principal data scientist working for a new startup that offers a price recommendation for rental home listings. Your beach-head market was San Francisco and this was where you trained the model, which is the core service of the business. But now, the startup is looking to expand into Seattle and Austin. Using the mean price difference between San Francisco and each new city, you want to make sure your price recommendations don't drift. If they drift too low, your customers will leave money on the table; if they drift too high, their listing will be vacant. Hitting the goldilox zone is critical for acquiring and keeping happy customers in Seattle.

Competitors in Seattle are within 65 dollars of the ideal price, and due to stiffer competition, competitors in Austin are within $40 of the ideal price. These are the benchmarks we need to hit to prove a viable product.

#### Goals 🎯

In this tutorial, you will learn how to:
1. Set up and view the results of stability tests.
2. Debug the true cause of stability issues.
3. Retest the new model and confirm the effectivenesss of the mitigation strategy.

### First, set the credentials for your TruEra deployment.

If you don't have credentials yet, get them instantly by signing up for the free open beta: https://app.truera.net

In [1]:
#connection details
TRUERA_URL = "https://app.truera.net"
AUTH_TOKEN = ""

### Install required packages for running in colab

In [None]:
! pip install --upgrade shap
! pip install --upgrade truera

### From here, run the rest of the notebook and follow the analysis.

### First, load data and train the in your beach-head market, San Francisco. Also add additional data for Seattle and Austin, your target markets.

In [3]:
import pandas as pd
import xgboost as xgb
from sklearn import preprocessing
import sklearn.metrics
from sklearn.utils import resample
import logging

from truera.client.truera_workspace import TrueraWorkspace
from truera.client.truera_authentication import TokenAuthentication

auth = TokenAuthentication(AUTH_TOKEN)
tru = TrueraWorkspace(TRUERA_URL, auth)

# set our environmetn to local compute so we can compute predictions and feature influences on our local machine
tru.set_environment("local")
# note: we'll periodically toggle between local and remote so we can interact with our remote deployment as well.

  from .autonotebook import tqdm as notebook_tqdm
, client side 11.5.6
.


In [4]:
# load data
san_francisco = pd.read_csv('https://truera-examples.s3.us-west-2.amazonaws.com/data/starter-stability/San_Francisco_for_stability.csv')
seattle = pd.read_csv('https://truera-examples.s3.us-west-2.amazonaws.com/data/starter-stability/Seattle_for_stability.csv')
austin = pd.read_csv('https://truera-examples.s3.us-west-2.amazonaws.com/data/starter-stability/Austin_for_stability.csv')

#make all float
san_francisco = san_francisco.astype(float)
seattle = seattle.astype(float)
austin = austin.astype(float)

#add point ids
sf_ids = [f"point_{i}" for i in range(len(san_francisco))]
san_francisco["id"] = sf_ids

se_ids = [f"point_{i}" for i in range(len(seattle))]
seattle["id"] = se_ids

a_ids = [f"point_{i}" for i in range(len(austin))]
austin["id"] = a_ids

# train first model
xgb_reg = xgb.XGBRegressor(eta = 0.2, max_depth = 4)
xgb_reg.fit(san_francisco.drop(['price', 'id'], axis = 1), san_francisco.price)

# create the first project and data collection
project_name = "Starter Example Companion - Drift"
tru.add_project(project_name, score_type = 'regression')
tru.add_data_collection("Data Collection v1")

# add data splits to the collection we just created
tru.add_data_split("San Francisco", pre_data = san_francisco, label_col_name = 'price', id_col_name = 'id', split_type = "train")
tru.add_data_split("Seattle", pre_data = seattle, label_col_name = 'price', id_col_name = 'id', split_type = "test")
tru.add_data_split("Austin", pre_data = austin, label_col_name = 'price', id_col_name = 'id', split_type = "oot")

# register the model
tru.add_python_model("model_1", xgb_reg, train_split_name="San Francisco", train_parameters = {"model_type":"xgb.XGBRegressor", "eta":0.2, "max_depth":4})

# sync with remote
tru.upload_project()

INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v1". 
INFO:truera.client.local.local_truera_workspace:Data split "San Francisco" is added to local data collection "Data Collection v1", and set as the data split for the workspace context.
INFO:truera.client.local.local_truera_workspace:Data split "Seattle" is added to local data collection "Data Collection v1", and set as the data split for the workspace context.
INFO:truera.client.local.local_truera_workspace:Data split "Austin" is added to local data collection "Data Collection v1", and set as the data split for the workspace context.
INFO:truera.client.local.local_truera_workspace:Model "model_1" is added and associated with local data collection "Data Collection v1". "model_1" is set as the model for the workspace context.
INFO:truera.client.truera_workspace:Uploading data collection Data Collection v1.
INFO:truera.client.truera_workspace:Uploading data split San Fr

Uploading tmp26x7x92g.parquet -- ### -- file upload complete.
Put resource done.
Uploading tmphednotq6.parquet -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Call to join: rowsets ['a97b58ea-8475-4667-b4b8-5e3b995bb62a', '81b3673e-a96a-407b-a9f5-b1e2c92c062f'] on ['id'] with default inner join.
INFO:truera.client.truera_workspace:Uploading data split Seattle.


Uploading tmp89drxfoe.parquet -- ### -- file upload complete.
Put resource done.
Uploading tmph4gvoc86.parquet -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Call to join: rowsets ['5b1c7d06-f281-4c15-9f95-260007d1f024', '357b2b4e-c42d-4389-ae36-2c36e2ea3d14'] on ['id'] with default inner join.
INFO:truera.client.truera_workspace:Uploading data split Austin.


Uploading tmpxye6_jfy.parquet -- ### -- file upload complete.
Put resource done.
Uploading tmpop2xi693.parquet -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Call to join: rowsets ['a768cf10-4d31-459d-bf33-9bb78d567f44', '447e148a-487c-43a3-a10f-b5da24f7a0d3'] on ['id'] with default inner join.
INFO:truera.client.truera_workspace:Uploading model model_1.
INFO:truera.client.remote_truera_workspace:Uploading xgboost model: XGBRegressor


Verification Done
Uploading MLmodel -- ### -- file upload complete.
Uploading tmp_2zxumr0.json -- ### -- file upload complete.
Uploading conda.yaml -- ### -- file upload complete.
Uploading xgboost_regression_predict_wrapper.py -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Model "model_1" is added and associated with remote data collection "Data Collection v1". "model_1" is set as the model for the workspace context.


Model uploaded to: https://daily-demo-truera1.sandbox.truera.com/home/p/Starter%20Example%20Companion%20-%20Drift/m/model_1/
Uploading tmp7ude3sozparquet -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...


Uploading tmprwlk2j3kparquet -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...


Uploading tmpre6raejrparquet -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.truera_workspace:Influence algorithm for local project is: truera-qii
INFO:truera.client.truera_workspace:Influence algorithm for local project is: truera-qii
INFO:truera.client.truera_workspace:Influence algorithm for local project is: truera-qii


### Get the average ground truth price in each city to use for defining our stability test thresholds.

In [5]:
tru.set_data_split("San Francisco")
San_Francisco_mean_price = tru.get_ys().mean()
tru.set_data_split("Seattle")
Seattle_mean_price = tru.get_ys().mean()
tru.set_data_split("Austin")
Austin_mean_price = tru.get_ys().mean()

print("San Francisco mean listing price: " + str(San_Francisco_mean_price))
print("Seattle mean listing price: " + str(Seattle_mean_price))
print("Austin mean listing price: " + str(Austin_mean_price))

#calculate expected difference in price recommendations from beach-head to target market
Seattle_expected_difference = Seattle_mean_price - San_Francisco_mean_price
Austin_expected_difference = Austin_mean_price - San_Francisco_mean_price

print("Expected price difference from San Francisco to Seattle: " + str(Seattle_expected_difference))
print("Expected price difference from San Francisco to Austin: " + str(Austin_expected_difference))


San Francisco mean listing price: 205.2558100370495
Seattle mean listing price: 127.80739963264234
Austin mean listing price: 227.01126421697288
Expected price difference from San Francisco to Seattle: -77.44841040440717
Expected price difference from San Francisco to Austin: 21.755454179923362


### Test for stability in Seattle and Austin.

In [7]:
#toggle back to remote to interact with the tester

# add stability test
tru.set_environment("remote")
tru.set_project(project_name)
tru.set_data_collection("Data Collection v1")

# Create stability tests in accordance with the setup
tru.tester.add_stability_test(test_name = "Stability Test - Seattle",
    base_data_split_name = "San Francisco",
    comparison_data_split_name_regex = "Seattle",
    fail_if_outside = [Seattle_expected_difference - 65, Seattle_expected_difference + 65])

tru.tester.add_stability_test(test_name = "Stability Test - Austin",
    base_data_split_name = "San Francisco",
    comparison_data_split_names = ["Austin"],
    fail_if_outside = [Austin_expected_difference - 40, Austin_expected_difference + 40])

tru.set_model("model_1")
tru.tester.get_model_test_results(test_types=["stability"])

INFO:truera.client.remote_truera_workspace:Data collection in remote environment is now set to "Data Collection v1". 
INFO:truera.client.remote_truera_workspace:Setting remote model context to "model_1".


0,1,2,3,4,5,6,7
,Name,Comparison Split,Base Split,Segment,Metric,Score,Navigate
❌,Stability Test - Seattle,Seattle,San Francisco,ALL POINTS,DIFFERENCE_OF_MEAN,-4.5409,Explore in UI
❌,Stability Test - Austin,Austin,San Francisco,ALL POINTS,DIFFERENCE_OF_MEAN,64.1611,Explore in UI


The model fails in Seattle and Austin because the scores drifted too far from the ground truth in the new cities.

### From here, navigate to the TruEra Web App for analysis or continue on to Part 2!     [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1SIshdf_nE2dCWPdGNfUJ3UUuWgbocANn)