# Guided Exercise: Drift
This is a continuation of part 1. If you missed it:     [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/truera/truera-examples/blob/release/rc-1.37/starter-examples/starter-drift-part-1.ipynb)

#### Goals 🎯
In this tutorial, you will learn how to:
1. View the results of stability tests set up in part 1.
2. Debug the true cause of stability issues.
3. Retest the new model and confirm the effectiveness of the mitigation strategy.

### First, set the credentials for your TruEra deployment.
If you don't have credentials yet, get them instantly by signing up for free at: https://www.truera.com

In [1]:
# connection details
TRUERA_URL = "https://app.truera.net"
AUTH_TOKEN = "..."

### Install the required packages for running in colab

In [4]:
! pip install --upgrade truera



### From here, run the rest of the notebook and follow the analysis.

### First, load data and train the in your beach-head market, San Francisco. Also add additional data for Seattle and Austin, your target markets.

In [5]:
import logging
import pandas as pd
import xgboost as xgb
from sklearn import preprocessing
from sklearn.utils import resample

from truera.client.truera_workspace import TrueraWorkspace
from truera.client.truera_authentication import TokenAuthentication

auth = TokenAuthentication(AUTH_TOKEN)
tru = TrueraWorkspace(TRUERA_URL, auth, ignore_version_mismatch=True, log_level=logging.ERROR)

# set our environmetn to local compute so we can compute predictions and feature influences on our local machine
tru.set_environment("remote")
# note: we'll periodically toggle between local and remote so we can interact with our remote deployment as well.

### Test for stability in Seattle and Austin.

In [6]:
# create the first project and data collection
project_name = "Starter Example Companion - Drift"
tru.set_project(project_name)
tru.set_data_collection("Data Collection v1")
tru.get_models()

['model_1']

In [7]:
# add performance and feature importance tests
tru.tester.add_performance_test(
    test_name="MAE Test",
    all_data_collections=True,
    data_split_name_regex="Seattle",
    metric="MAE",
    reference_split_name="San Francisco",
    fail_if_greater_than=40,
    fail_threshold_type="RELATIVE"
)

In [8]:
tru.tester.get_model_leaderboard(sort_by="performance")

0,1,2,3,4,5,6
Model Name,Train Split Name,Train Parameters,Performance Tests (Failed/Warning/Total),Fairness Tests (Failed/Warning/Total),Stability Tests (Failed/Warning/Total),Feature Importance Tests (Failed/Warning/Total)
model_1,San Francisco,eta: 0.2 max_depth: 4.0 model_type: xgb.XGBRegressor,0 ❌ / 0 ⚠️ / 1,0 ❌ / 0 ⚠️ / 0,2 ❌ / 0 ⚠️ / 2,0 ❌ / 0 ⚠️ / 0


In [9]:
tru.set_model("model_1")
tru.tester.get_model_test_results(test_types=["stability"])

0,1,2,3,4,5,6,7
,Name,Comparison Split,Base Split,Segment,Metric,Score,Navigate
❌,Stability Test - Seattle,Seattle,San Francisco,ALL POINTS,DIFFERENCE_OF_MEAN,-4.5409,Explore in UI
❌,Stability Test - Austin,Austin,San Francisco,ALL POINTS,DIFFERENCE_OF_MEAN,64.1611,Explore in UI


The model fails in Seattle and Austin because the scores drifted too far from the ground truth in the new cities.

In [10]:
explainer = tru.get_explainer("Austin", comparison_data_splits=["San Francisco"])
explainer.find_hotspots(max_num_responses=5)

Unnamed: 0,segment_definition,MAE,size,size (%)
0,"_DATA_GROUND_TRUTH in_range [829.0, 999.0]",381.233582,185,2.023185
2,"bathrooms in_range [2.5, 7.0]",174.443787,1055,11.53762
1,"accommodates in_range [6.0, 16.0]",161.751678,2619,28.641732
4,"beds in_range [3.0, 16.0]",155.646561,2636,28.827647
3,"bedrooms in_range [2.0, 10.0]",155.508789,3835,41.94007


As an example, let's focus on the segment relaying bedrooms and compare the MAE of that group to the rest.

In [11]:
explainer = tru.get_explainer(base_data_split="Austin")
tru.set_data_split("Austin")
tru.add_segment_group("Bedrooms", {"Few Bedrooms": "bedrooms < 2", "More Bedrooms": "bedrooms >= 2"})
explainer.set_segment(segment_group_name="Bedrooms", segment_name="Few Bedrooms")
print("Few bedrooms MAE: " + str(explainer.compute_performance(metric_type="MAE")))
explainer.set_segment(segment_group_name="Bedrooms", segment_name="More Bedrooms")
print("More bedrooms MAE: " + str(explainer.compute_performance(metric_type="MAE")))

Few bedrooms MAE: 77.29932403564453
More bedrooms MAE: 155.5087890625


As expected, the MAE for 2+ bedroom listings is way higher (double) that of fewer bedroom listings. Let's resample the San Francisco data we're training on to include an equal proportion of larger listings as Austin.

In [12]:
# load data
san_francisco = pd.read_csv("https://truera-examples.s3.us-west-2.amazonaws.com/data/starter-stability/San_Francisco_for_stability.csv")
seattle = pd.read_csv("https://truera-examples.s3.us-west-2.amazonaws.com/data/starter-stability/Seattle_for_stability.csv")
austin = pd.read_csv("https://truera-examples.s3.us-west-2.amazonaws.com/data/starter-stability/Austin_for_stability.csv")

# make all float
san_francisco = san_francisco.astype(float).fillna(value=0)
seattle = seattle.astype(float).fillna(value=0)
austin = austin.astype(float).fillna(value=0)

# add point ids
sf_ids = [f"point_{i}" for i in range(len(san_francisco))]
san_francisco["id"] = sf_ids
se_ids = [f"point_{i}" for i in range(len(seattle))]
seattle["id"] = se_ids
au_ids = [f"point_{i}" for i in range(len(austin))]
austin["id"] = au_ids

In [13]:
large_listings = san_francisco[san_francisco["bedrooms"] >= 2]
small_listings = san_francisco[san_francisco["bedrooms"] < 2]

austin_large_listings = austin[austin["bedrooms"] >= 2]
num_samples = int(round((len(austin_large_listings)/len(austin)) * len(san_francisco), 0))

large_listings_resampled = resample(
    large_listings, 
    replace=True,
    n_samples=num_samples,
    random_state=1 # include random seed so we can perform same sampling on each data set
)

san_francisco_resampled = pd.concat([small_listings, large_listings_resampled])

# train new model on resampled sf data
xgb_reg = xgb.XGBRegressor(eta=0.2, max_depth=4)
xgb_reg.fit(san_francisco_resampled.drop(["id", "price"], axis=1), san_francisco_resampled.price)

# download project
tru.set_environment("local")
tru.download_project(project_name)
tru.set_project(project_name)
tru.set_data_collection("Data Collection v1")

# register the model
tru.add_python_model(
    "model_2",
    xgb_reg,
    train_parameters={"model_type": "xgb.XGBRegressor", "eta": 0.2, "max_depth": 4}
)

# sync with remote
tru.upload_project()



|          | 0.000% [00:00<?]



|          | 0.000% [00:00<?]



|          | 0.000% [00:00<?]

In [14]:
# check stability results
tru.set_environment("remote")
tru.set_model("model_2")
tru.tester.get_model_test_results(test_types=["stability"])

0,1,2,3,4,5,6,7
,Name,Comparison Split,Base Split,Segment,Metric,Score,Navigate
❌,Stability Test - Seattle,Seattle,San Francisco,ALL POINTS,DIFFERENCE_OF_MEAN,-0.8494,Explore in UI
✅,Stability Test - Austin,Austin,San Francisco,ALL POINTS,DIFFERENCE_OF_MEAN,58.5707,Explore in UI


The model now passes in Austin and is ready for production, while it still fails in Seattle. Let's continue to iterate on Seattle.

Since the model errors with scores that are too high, we should look for the largest positive contributors to score drift.

In [15]:
explainer = tru.get_explainer("San Francisco", comparison_data_splits=["Seattle"])
explainer.compute_feature_contributors_to_instability(use_difference_of_means=True).T

Unnamed: 0,Seattle
availability_90,0.183109
bathrooms,-0.024676
accommodates,0.013577
bedrooms,-0.017046
reviews_per_month,-0.07071
beds,0.003723
minimum_nights,0.032455
extra_people,-0.042551
availability_365,0.048124
features_Host_Is_Superhost,0.003985


Availability_90 is by far the largest positive contributor to score drift in Seattle. Let's remove that feature along with the related feature Availability_365 to mitigate this issue.

In [16]:
tru.set_environment("local")

# train first model
xgb_reg = xgb.XGBRegressor(eta=0.2, max_depth=4)
xgb_reg.fit(san_francisco_resampled.drop(["id", "price", "availability_90", "availability_365"], axis=1), san_francisco_resampled.price)

# create the first project and data collection
tru.add_data_collection("Data Collection v2")

# add data splits to the collection we just created
tru.add_data_split("San Francisco", pre_data=san_francisco.drop(["availability_90", "availability_365"], axis=1), label_col_name="price", id_col_name="id", split_type="train")
tru.add_data_split("Seattle", pre_data=seattle.drop(["availability_90", "availability_365"], axis=1), label_col_name="price", id_col_name="id", split_type="test")
tru.add_data_split("Austin", pre_data=austin.drop(["availability_90", "availability_365"], axis=1), label_col_name="price", id_col_name="id", split_type="oot")

# register the model
tru.add_python_model(
    "model_3",
    xgb_reg,
    train_parameters={"model_type": "xgb.XGBRegressor", "eta": 0.2, "max_depth": 4}
)

# sync with remote
tru.upload_project()



|          | 0.000% [00:00<?]



|          | 0.000% [00:00<?]



|          | 0.000% [00:00<?]

In [17]:
# get the test details from model_2 so we can copy them for model_3
tru.set_environment("remote")
tru.set_model("model_2")
tru.tester.get_model_tests().as_dict()["Stability Tests"]["Rows"]

[['Stability Test - Seattle',
  'Seattle',
  'San Francisco',
  'ALL POINTS',
  'DIFFERENCE_OF_MEAN',
  '',
  'Not specified',
  'DIFFERENCE_OF_MEAN < -142.44841 OR DIFFERENCE_OF_MEAN > -12.44841'],
 ['Stability Test - Austin',
  'Austin',
  'San Francisco',
  'ALL POINTS',
  'DIFFERENCE_OF_MEAN',
  '',
  'Not specified',
  'DIFFERENCE_OF_MEAN < -18.244545 OR DIFFERENCE_OF_MEAN > 61.755455']]

In [18]:
# toggle back to remote to interact with the tester

# check stability results
tru.set_model("model_3")

# let the warn conditions have $50 in wiggle room
tru.tester.add_stability_test(test_name="Stability Test - Seattle - v3",
    base_data_split_name="San Francisco",
    comparison_data_split_names=["Seattle"],
    fail_if_outside=[-142.44841, -12.44841]
)

tru.tester.get_model_test_results(test_types=["stability"])

0,1,2,3,4,5,6,7
,Name,Comparison Split,Base Split,Segment,Metric,Score,Navigate
✅,Stability Test - Seattle - v3,Seattle,San Francisco,ALL POINTS,DIFFERENCE_OF_MEAN,-15.448,Explore in UI


In v3, the model passes now in Seattle. We can deploy the v2 model in Austin and v3 model in Seattle as we launch and the investors of our startup are satisfied with these results!