# Guided Exercise: Performance Part 2
#### Goals 🎯
In this tutorial, you will use TruEra to make performance improvements to the model created in part 1 in a structured and methodical way!

If you missed part one and need to go back:     [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/truera/truera-examples/blob/release/rc-1.37/starter-examples/starter-performance-part-1.ipynb)

In this tutorial, you will:
1. view the results of performance and feature importance tests created in part 1
2. Find actionable issues with the model created in part 1
3. Mitigate these issues and re-upload your model to TruEra.
4. Retest the new model and confirm the effectivenesss of the mitigation strategy.

### First, set the credentials for your TruEra deployment.

If you don't have credentials yet, get them instantly by signing up for free at: https://www.truera.com

In [None]:
#connection details
TRUERA_URL = "https://app.truera.net"
AUTH_TOKEN = ""

### Install the required packages for running in colab

In [None]:
! pip install --upgrade truera

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### From here, run the rest of the notebook and follow the analysis.

In [None]:
import pandas as pd
import xgboost as xgb
import logging

from truera.client.truera_workspace import TrueraWorkspace
from truera.client.truera_authentication import TokenAuthentication

auth = TokenAuthentication(AUTH_TOKEN)
tru = TrueraWorkspace(TRUERA_URL, auth)

# set our environment to remote to view the test results from part one
tru.set_environment("remote")
# note: we'll periodically toggle between local and remote so we can interact with our remote deployment as well.

### First, let's review the test results from part 1.

In [None]:

# set project and data collection
project_name = "Starter Example Companion - Performance"
tru.set_project(project_name)
tru.set_data_collection("Data Collection v1")

# get model results
tru.set_model("model_1")
tru.tester.get_model_test_results(test_types=["performance"])

INFO:truera.client.remote_truera_workspace:Data collection in remote environment is now set to "Data Collection v1". 
INFO:truera.client.remote_truera_workspace:Setting remote model context to "model_1".


0,1,2,3,4,5,6
,Name,Split,Segment,Metric,Score,Navigate
❌,Relative MAE Test,Seattle,ALL POINTS,MAE,110.2897,Explore in UI
❌,RMSE Test,Seattle,ALL POINTS,RMSE,154.0362,Explore in UI
✅,RMSE Test,San Francisco,ALL POINTS,RMSE,75.6326,Explore in UI


### 2/3 tests fail.

### But can we narrow down the problem?

In [None]:
train_split_name = "San Francisco"
test_split_name = "Seattle"

# generate the explainer and compute performance
explainer = tru.get_explainer(test_split_name, comparison_data_splits=[train_split_name])

explainer.compute_model_score_instability(score_type=None, use_difference_of_means=False)

29.874412536621094

One common reason for overfitting is a distributional shift between train and test splits. Are there distributional shifts in the features?

Since our features are on different scales, we should choose a distance metric that is scale invariant.

In [None]:
explainer.compute_feature_contributors_to_instability().T.sort_values(by="San Francisco", ascending = False)

Unnamed: 0,San Francisco
latitude,0.141920
bedrooms,0.116459
availability_90,0.088282
longitude,0.086429
room_type_Entire_home/apt,0.066453
...,...
amenities_Smartlock,0.000000
amenities_Crib,0.000000
amenities_Stair_gates,0.000000
amenities_Washer_/_Dryer,0.000000


Latitude and longitude have by far the largest distributional shift between San Francisco and Seattle.

### Analyze root cause of problem and attempt to mitigate issue

Example possible causes:
1. Mislabeled points
2. Train/test not from same distribution
3. Data pipeline error
4. Too many unimportant features
5. Insufficient test data
6. Target leakage in the training process

We can identify two issues:

2: We corroborated through finding (1) high error segments (2) features driving the error and (3) comparing the distance between their distributions, that the distributional shift of latitude and longitude is a large source of error in Seattle.

4: Signaled by the feature importance test, the number of unimportant features that may be causing our overfitting.

Let's address these issues one at a time. First we can mitigate the error from latitude and longitude.

To do so, we will create new features for each city to be the distance from city center (by latitude, longitude and pairwise).

In [None]:
# load data
san_francisco = pd.read_csv('https://truera-examples.s3.us-west-2.amazonaws.com/data/starter-performance/San_Francisco.csv')
seattle = pd.read_csv('https://truera-examples.s3.us-west-2.amazonaws.com/data/starter-performance/Seattle.csv')

#make all float
san_francisco = san_francisco.astype(float)
seattle = seattle.astype(float)

#add point ids
sf_ids = [f"point_{i}" for i in range(len(san_francisco))]
san_francisco["id"] = sf_ids

se_ids = [f"point_{i}" for i in range(len(seattle))]
seattle["id"] = se_ids

import math

# create a generalizable feature from lat and lon, distance from city center
def create_lat_lon_features(df, city_center_lat, city_center_lon):

  # calculate the distance from the mean latitude and longitude values

  df["lat_dist"] = df["latitude"].apply(lambda x: abs(x - city_center_lat))
  df["lon_dist"] = df["longitude"].apply(lambda x: abs(x - city_center_lon))

  # calculate the pairwise Euclidean distance between each latitude and longitude
  df["lat_lon_dist"] = df.apply(lambda x: math.sqrt(x["lat_dist"]**2 + x["lon_dist"]**2), axis=1)
  df = df.drop(['latitude','longitude'], axis = 1)

  # return the modified dataframe
  return df

san_francisco_v2 = create_lat_lon_features(san_francisco, 37.7749, -122.4194)
seattle_v2 = create_lat_lon_features(seattle, 47.6062, -122.3321)

xgb_reg = xgb.XGBRegressor(eta = 0.2, max_depth = 4)
xgb_reg.fit(san_francisco_v2.drop(['price','id'], axis = 1), san_francisco_v2.price)

In [None]:
# switch to local mode to add new data and model
tru.download_project(project_name)
tru.set_environment("local")
tru.set_project(project_name)

# since we changed our feature space, we need to add a new data collection
tru.add_data_collection("Data Collection v2")

# add data splits
tru.add_data_split("San Francisco", pre_data = san_francisco_v2.drop('price', axis = 1), label_data = san_francisco_v2[['id','price']], id_col_name = "id", split_type = "train")
tru.add_data_split("Seattle", pre_data = seattle_v2.drop('price', axis = 1), label_data = seattle_v2[['id','price']], id_col_name = "id", split_type = "test")

# add model
tru.add_python_model("model_2", xgb_reg, train_split_name="San Francisco", train_parameters = {"model_type":"xgb.XGBRegressor", "eta":0.2, "max_depth":4})

# sync with remote
tru.upload_project(upload_error_influences = False)
# Note that we're explicitly opting not to compute and upload error influences to optimize for ingestion speed. This is because SHAP is not optimized to compute error influences for regression models.
# Doing so will disable the "Debug" tab under Performance, and the "Contributors to Error Influence" tab under Drift.
# If you'd like to enable these pages, please remove the flag, i.e. run the method below. Please be aware that doing so will greatly increase the time required for ingestion.
# tru.upload_project()

INFO:truera.client.truera_workspace:Download temp_dir: /tmp/tmpwuw2iegv
INFO:truera.client.truera_workspace:Syncing data collection "Data Collection v1" to local.
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v1". 
INFO:truera.client.truera_workspace:Syncing data split "San Francisco" to local.
INFO:truera.client.local.local_truera_workspace:Data split "San Francisco" is added to local data collection "Data Collection v1", and set as the data split for the workspace context.
INFO:truera.client.truera_workspace:Syncing data split "Seattle" to local.
INFO:truera.client.local.local_truera_workspace:Data split "Seattle" is added to local data collection "Data Collection v1", and set as the data split for the workspace context.
INFO:truera.client.truera_workspace:Skipping sync of model "model_1" as it is not a PyFunc model.
INFO:truera.client.truera_workspace:Syncing segments groups from remote to local.
INFO:truera.clien

Uploading tmpx31_1xnq.parquet -- ### -- file upload complete.
Put resource done.
Uploading tmpwkjqk454.parquet -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Call to join: rowsets ['3a69dc31-5bf6-49d8-a40b-519412e7f140', 'd654814a-22e1-43cc-8790-e6fee62c3520'] on ['id'] with default inner join.
INFO:truera.client.truera_workspace:Uploading data split Seattle.


Uploading tmpokanewv1.parquet -- ### -- file upload complete.
Put resource done.
Uploading tmpx6qio1kn.parquet -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Call to join: rowsets ['646cafb0-fbf0-4c17-b026-1d0bc5f008e2', 'e733eb99-3d50-4817-a3b2-c26ce8dc8670'] on ['id'] with default inner join.
INFO:truera.client.local.local_truera_workspace:Setting local model context to "model_2".
INFO:truera.client.remote_truera_workspace:Setting remote model context to "model_2".


Uploading tmp0kyx0aq9parquet -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...


Uploading tmp9e2rmnr3parquet -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.truera_workspace:Influence algorithm for local project is: shap


Uploading tmp0w7fgbx6parquet -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.truera_workspace:Influence algorithm for local project is: shap


Uploading tmp9q8du5paparquet -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.remote_truera_workspace:Model "model_1" in remote is associated with a different data_collection ("Data Collection v1") than the one in context ("Data Collection v2").
INFO:truera.client.remote_truera_workspace:Data collection in remote environment is now set to "Data Collection v1". The previous data collection ("Data Collection v2") and its associated data splits and/or models have been cleared from the remote environment workspace context.
INFO:truera.client.remote_truera_workspace:Setting remote model context to "model_1".


In [None]:
#toggle back to remote to interact with the tester
tru.set_environment("remote")
tru.set_project(project_name)
tru.set_data_collection("Data Collection v2")
tru.set_model("model_2")
tru.tester.get_model_test_results(test_types=["performance"])

INFO:truera.client.remote_truera_workspace:Data collection in remote environment is now set to "Data Collection v2". The previous data collection ("Data Collection v1") and its associated data splits and/or models have been cleared from the remote environment workspace context.
INFO:truera.client.remote_truera_workspace:Setting remote model context to "model_2".


0,1,2,3,4,5,6
,Name,Split,Segment,Metric,Score,Navigate
,Relative MAE Test,Seattle,ALL POINTS,MAE,77.062,Explore in UI
✅,RMSE Test,San Francisco,ALL POINTS,RMSE,75.2512,Explore in UI
✅,RMSE Test,Seattle,ALL POINTS,RMSE,106.8475,Explore in UI


### 🪄 Huzzah! Using the Test Harness as our guide, we quickly diagnosed the true cause of overfitting and improved performance.