# Guided Exercise: Drift

This is a continuation of part 1. If you missed it:     [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/15an365tkQZt2g_12O2VeWMSf3mVevnM7)

#### Goals 🎯

In this tutorial, you will learn how to:
1. View the results of stability tests set up in part 1.
2. Debug the true cause of stability issues.
3. Retest the new model and confirm the effectivenesss of the mitigation strategy.

### First, set the credentials for your TruEra deployment.
If you don't have credentials yet, get them instantly by signing up for the free open beta: https://app.truera.net

In [None]:
#connection details
TRUERA_URL = "https://app.truera.net"
AUTH_TOKEN = ""

### Install the required packages for running in colab

In [None]:
! pip install --upgrade Flask
! pip install --upgrade protobuf
! pip install --upgrade shap
! pip install --upgrade truera

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### From here, run the rest of the notebook and follow the analysis.

### First, load data and train the in your beach-head market, San Francisco. Also add additional data for Seattle and Austin, your target markets.

In [None]:
import pandas as pd
import xgboost as xgb
from sklearn import preprocessing
import sklearn.metrics
from sklearn.utils import resample
import logging
import pandas as pd
import xgboost as xgb

from truera.client.truera_workspace import TrueraWorkspace
from truera.client.truera_authentication import TokenAuthentication

auth = TokenAuthentication(AUTH_TOKEN)
tru = TrueraWorkspace(TRUERA_URL, auth, ignore_version_mismatch=True, log_level=logging.ERROR)

# set our environmetn to local compute so we can compute predictions and feature influences on our local machine
tru.set_environment("remote")
# note: we'll periodically toggle between local and remote so we can interact with our remote deployment as well.

### Test for stability in Seattle and Austin.

In [None]:
# create the first project and data collection
project_name = "Starter Example Companion - Drift"
tru.set_project(project_name)
tru.set_data_collection("Data Collection v1")
tru.get_models()

['xgboost_v1', 'xgboost_v2', 'xgboost_v3']

In [None]:
# add performance and feature importance tests
tru.tester.add_performance_test(
    test_name = 'MAE Test',
        all_data_collections = True,
        data_split_name_regex = 'Seattle',
        metric="MAE",
        reference_split_name='San Francisco',
        fail_if_greater_than=40,
        fail_threshold_type="RELATIVE")

In [None]:
tru.tester.get_model_leaderboard(sort_by='performance')

0,1,2,3,4,5,6
Model Name,Train Split Name,Train Parameters,Performance Tests (Failed/Warning/Total),Fairness Tests (Failed/Warning/Total),Stability Tests (Failed/Warning/Total),Feature Importance Tests (Failed/Warning/Total)
xgboost_v1,San Francisco,eta: 0.2 max_depth: 4.0 model_type: xgb.XGBRegressor,0 ❌ / 0 ⚠️ / 1,0 ❌ / 0 ⚠️ / 0,2 ❌ / 0 ⚠️ / 2,0 ❌ / 0 ⚠️ / 0
xgboost_v2,San Francisco - resampled,eta: 0.2 max_depth: 4.0 model_type: xgb.XGBRegressor,0 ❌ / 0 ⚠️ / 1,0 ❌ / 0 ⚠️ / 0,1 ❌ / 0 ⚠️ / 2,0 ❌ / 0 ⚠️ / 0


In [None]:
tru.set_model('model_1')
tru.tester.get_model_test_results(test_types=["stability"])

0,1,2,3,4,5,6,7
,Name,Comparison Split,Base Split,Segment,Metric,Score,Navigate
❌,Stability Test - Seattle,Seattle,San Francisco,ALL POINTS,DIFFERENCE_OF_MEAN,-4.5409,Explore in UI
❌,Stability Test - Austin,Austin,San Francisco,ALL POINTS,DIFFERENCE_OF_MEAN,64.1611,Explore in UI


The model fails in Seattle and Austin because the scores drifted too far from the ground truth in the new cities.

In [None]:
explainer = tru.get_explainer('Austin', comparison_data_splits=['San Francisco'])
explainer.suggest_high_error_segments()

Unnamed: 0,representation,MAE,size,size (%)
0,accommodates <= 16.0 AND accommodates >= 6.0,161.751678,2619,28.641732
1,bathrooms <= 7.0 AND bathrooms >= 2.5,174.443787,1055,11.53762
2,bedrooms <= 10.0 AND bedrooms >= 2.0,155.508789,3835,41.94007


Notice the top four error segments are all related to listing size. Let's focus on the largest group, bedrooms and compare the MAE of that group to the rest.

In [None]:
explainer = tru.get_explainer(base_data_split='Austin')
tru.set_data_split("Austin")
#tru.add_segment_group("Bedrooms", {"Few Bedrooms": "bedrooms < 2", "More Bedrooms": "bedrooms >= 2"})
explainer.set_segment(segment_group_name = "Bedrooms", segment_name = "Few Bedrooms")
print("Few bedrooms mae: " + str(explainer.compute_performance(metric_type="MAE")))
explainer.set_segment(segment_group_name = "Bedrooms", segment_name = "More Bedrooms")
print("More bedrooms mae: " + str(explainer.compute_performance(metric_type="MAE")))

Few bedrooms mae: 77.29932403564453
More bedrooms mae: 155.5087890625


As expected, the MAE for 2+ bedroom listings is way higher (double) that of fewer bedroom listings. Let's resample the San Francisco data we're training on to include an equal proportion of larger listings as Austin.

In [None]:
# load data
san_francisco = pd.read_csv('https://truera-examples.s3.us-west-2.amazonaws.com/data/starter-stability/San_Francisco_for_stability.csv')
seattle = pd.read_csv('https://truera-examples.s3.us-west-2.amazonaws.com/data/starter-stability/Seattle_for_stability.csv')
austin = pd.read_csv('https://truera-examples.s3.us-west-2.amazonaws.com/data/starter-stability/Austin_for_stability.csv')

#make all float
san_francisco = san_francisco.astype(float).fillna(value=0)
seattle = seattle.astype(float).fillna(value=0)
austin = austin.astype(float).fillna(value=0)

#add point ids
sf_ids = [f"point_{i}" for i in range(len(san_francisco))]
san_francisco["id"] = sf_ids

se_ids = [f"point_{i}" for i in range(len(seattle))]
seattle["id"] = se_ids

a_ids = [f"point_{i}" for i in range(len(austin))]
austin["id"] = a_ids

In [None]:
large_listings = san_francisco[san_francisco['bedrooms'] >= 2]
small_listings = san_francisco[san_francisco['bedrooms'] < 2]

austin_large_listings = austin[austin['bedrooms'] >= 2]
num_samples = int(round((len(austin_large_listings)/len(austin)) * len(san_francisco), 0))

large_listings_resampled = resample(
        large_listings, 
        replace=True,
        n_samples=num_samples,
        random_state=1 # include random seed so we can perform same sampling on each data set
        )

san_francisco_resampled = pd.concat([small_listings, large_listings_resampled])

# train new model on resampled sf data
xgb_reg = xgb.XGBRegressor(eta = 0.2, max_depth = 4)
xgb_reg.fit(san_francisco_resampled.drop(['id','price'], axis = 1), san_francisco_resampled.price)

# add resampled data split
tru.add_data_split("San Francisco - resampled",
                   pre_data = san_francisco_resampled,
                   label_col_name = "price",
                   id_col_name = "id",
                   split_type = "train")

# register the model
tru.add_python_model("model_2", xgb_reg, train_split_name="San Francisco - resampled",
                     train_parameters = {"model_type":"xgb.XGBRegressor", "eta":0.2, "max_depth":4})

# sync with remote
tru.upload_project()


In [None]:
# check stability results
tru.set_environment("remote")
tru.set_model("model_2")
    
tru.tester.get_model_test_results(test_types=["stability"])

0,1,2,3,4,5,6,7
,Name,Comparison Split,Base Split,Segment,Metric,Score,Navigate
❌,Stability Test - Seattle,Seattle,San Francisco,ALL POINTS,DIFFERENCE_OF_MEAN,-1.0388,Explore in UI
✅,Stability Test - Austin,Austin,San Francisco,ALL POINTS,DIFFERENCE_OF_MEAN,57.4268,Explore in UI


The model now passes in Austin and is ready for production, while it still fails in Seattle. Let's continue to iterate on Seattle.

Since the model errors with scores that are too high, we should look for the largest positive contributors to score drift.

In [None]:
explainer = tru.get_explainer("San Francisco", comparison_data_splits=["Seattle"])
explainer.compute_feature_contributors_to_instability(use_difference_of_means=True)

Unnamed: 0,Seattle
availability_90,0.181856
room_type_Entire_home/apt,0.081749
accommodates,0.065924
minimum_nights,0.047815
amenities_Smoke_detector,0.047023


Availability_90 is by far the largest positive contributor to score drift in Seattle. Let's remove that feature along with the related feature Availability_365 to mitigate this issue.

In [None]:
tru.set_environment("local")

# train first model
xgb_reg = xgb.XGBRegressor(eta = 0.2, max_depth = 4)
xgb_reg.fit(san_francisco_resampled.drop(['price','availability_90', 'availability_365'], axis = 1), san_francisco_resampled.price)

# create the first project and data collection
tru.add_data_collection("Data Collection v2")

# add data splits to the collection we just created
tru.add_data_split("San Francisco", pre_data = san_francisco.drop(['availability_90', 'availability_365'], axis = 1), label_col_name = "price", id_col_name = "id", split_type = "train")
tru.add_data_split("San Francisco - resampled", pre_data = san_francisco_resampled.drop(['availability_90', 'availability_365'], axis = 1), label_col_name = "price", id_col_name = "id", split_type = "train")
tru.add_data_split("Seattle", pre_data = seattle.drop(['availability_90', 'availability_365'], axis = 1), label_col_name = "price", id_col_name = "id", split_type = "test")
tru.add_data_split("Austin", pre_data = austin.drop(['availability_90', 'availability_365'], axis = 1), label_col_name = "price", id_col_name = "id", split_type = "oot")

# register the model
tru.add_python_model("model_3", xgb_reg, train_split_name="San Francisco - resampled", train_parameters = {"model_type":"xgb.XGBRegressor", "eta":0.2, "max_depth":4})

# sync with remote
tru.upload_project()

                                   

In [None]:
#get the test details from model_2 so we can copy them for model_3
tru.set_environment("remote")
tru.set_model("model_2")
tru.tester.get_model_tests().as_dict()['Stability Tests']['Rows']

[['Stability Test - Seattle',
  'Seattle',
  'San Francisco',
  'ALL POINTS',
  'DIFFERENCE_OF_MEAN',
  '',
  'Not specified',
  'DIFFERENCE_OF_MEAN < -142.44841 OR DIFFERENCE_OF_MEAN > -12.44841'],
 ['Stability Test - Austin',
  'Austin',
  'San Francisco',
  'ALL POINTS',
  'DIFFERENCE_OF_MEAN',
  '',
  'Not specified',
  'DIFFERENCE_OF_MEAN < -18.244545 OR DIFFERENCE_OF_MEAN > 61.755455']]

In [None]:
#toggle back to remote to interact with the tester

# check stability results
tru.set_model("model_3")
tru.tester.delete_tests(test_type="stability")
# Let the warn conditions have $50 in wiggle room
tru.tester.add_stability_test(test_name = "Stability Test - Seattle - v3",
    base_data_split_name = "San Francisco",
    comparison_data_split_names = ["Seattle"],
    fail_if_outside = [-142.44841, -12.44841])

tru.tester.get_model_test_results(test_types=["stability"])

0,1,2,3,4,5,6,7
,Name,Comparison Split,Base Split,Segment,Metric,Score,Navigate
✅,Stability Test - Seattle - v3,Seattle,San Francisco,ALL POINTS,DIFFERENCE_OF_MEAN,-12.9788,Explore in UI


In v3, the model passes now in Seattle. We can deploy the v2 model in Austin and v3 model in Seattle as we launch and the investors of our startup are satisfied with these results!