# Guided Exercise: Performance Part 2

### Goals 🎯
In this tutorial, you will use TruEra to make performance improvements to the model created in part 1 in a structured and methodical way!

If you missed part one and need to go back:     [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/truera/truera-examples/blob/release/rc-1.37/starter-examples/starter-performance-part-1.ipynb)

In this tutorial, you will:
1. view the results of performance and feature importance tests created in part 1
2. Find actionable issues with the model created in part 1
3. Mitigate these issues and re-upload your model to TruEra.
4. Retest the new model and confirm the effectivenesss of the mitigation strategy.

### First, set the credentials for your TruEra deployment.

If you don't have credentials yet, get them instantly by signing up for free at: https://www.truera.com

In [1]:
# connection details
TRUERA_URL = "https://app.truera.net"
AUTH_TOKEN = ""

### Install the required packages for running in colab

In [2]:
#! pip install truera

In [3]:
# delete this cell
# uncomment out install of truera
# uncomment out workspace/auth
# clear cell output

import os
import sys

os.chdir("/Users/davidkurokawa/Work/code/truera/truera/python")
if "/Users/davidkurokawa/Work/code/truera/truera/python" not in sys.path:
    sys.path.append("/Users/davidkurokawa/Work/code/truera/truera/python")

from truera.client.truera_authentication import BasicAuthentication, TokenAuthentication
from truera.client.truera_workspace import TrueraWorkspace

TRUERA_URL = "http://localhost:8000"
AUTH_TOKEN = ""
tru = TrueraWorkspace(TRUERA_URL, BasicAuthentication("ailens", "ailens123"))

INFO:truera.client.remote_truera_workspace:Connecting to 'http://localhost:8000'


### From here, run the rest of the notebook and follow the analysis.

In [4]:
import logging

import pandas as pd
import xgboost as xgb
from truera.client.ingestion import ColumnSpec, ModelOutputContext
from truera.client.truera_authentication import TokenAuthentication
from truera.client.truera_workspace import TrueraWorkspace

#auth = TokenAuthentication(AUTH_TOKEN)
#tru = TrueraWorkspace(TRUERA_URL, auth)

### First, let's review the test results from part 1.

In [5]:
# set project and data collection
project_name = "Starter Example Companion - Performance"
tru.set_project(project_name)
tru.set_data_collection("Data Collection v1")

# get model results
tru.set_model("model_1")
tru.tester.get_model_test_results(test_types=["performance"])

INFO:truera.client.remote_truera_workspace:Data collection in remote environment is now set to "Data Collection v1". 
INFO:truera.client.remote_truera_workspace:Setting remote model context to "model_1".


0,1,2,3,4,5,6
,Name,Split,Segment,Metric,Score,Navigate
❌,Relative MAE Test,Seattle,ALL POINTS,MAE,110.2694,Explore in UI
❌,RMSE Test,Seattle,ALL POINTS,RMSE,154.0107,Explore in UI
✅,RMSE Test,San Francisco,ALL POINTS,RMSE,75.6847,Explore in UI


### 2/3 tests fail.

### But can we narrow down the problem?

In [6]:
train_split_name = "San Francisco"
test_split_name = "Seattle"

# generate the explainer and compute performance
explainer = tru.get_explainer(test_split_name, comparison_data_splits=[train_split_name])

explainer.compute_model_score_instability(score_type=None, use_difference_of_means=False)

INFO:truera.client.truera_workspace:Download temp_dir: /var/folders/pg/2f8qcnr92cx4rcwpvm_x2ckc0000gn/T/tmpglyvi4m9
INFO:truera.client.truera_workspace:Syncing data collection "Data Collection v1" to local.
INFO:truera.client.local.local_truera_workspace:Data collection in local environment is now set to "Data Collection v1". 
INFO:truera.client.truera_workspace:Syncing data split "San Francisco" to local.
INFO:truera.client.local.local_truera_workspace:Data split "San Francisco" is added to local data collection "Data Collection v1", and set as the data split for the workspace context.
INFO:truera.client.truera_workspace:Syncing data split "Seattle" to local.
INFO:truera.client.local.local_truera_workspace:Data split "Seattle" is added to local data collection "Data Collection v1", and set as the data split for the workspace context.
INFO:truera.client.truera_workspace:Syncing model model_1 to local.
INFO:truera.client.local.local_truera_workspace:Model "model_1" is added and associat

30.131076228459097

One common reason for overfitting is a distributional shift between train and test splits. Are there distributional shifts in the features?

Since our features are on different scales, we should choose a distance metric that is scale invariant.

In [7]:
explainer.compute_feature_contributors_to_instability().T.sort_values(by="San Francisco", ascending=False)



Unnamed: 0,San Francisco
latitude,0.136916
bedrooms,0.096992
longitude,0.090943
availability_90,0.080386
reviews_per_month,0.056922
...,...
features_nan,0.000000
amenities_Children’s_dinnerware,0.000000
amenities_Stair_gates,0.000000
amenities_Washer_/_Dryer,0.000000


Latitude and longitude have by far the largest distributional shift between San Francisco and Seattle.

### Analyze root cause of problem and attempt to mitigate issue

Example possible causes:
1. Mislabeled points
2. Train/test not from same distribution
3. Data pipeline error
4. Too many unimportant features
5. Insufficient test data
6. Target leakage in the training process

We can identify two issues:

2: We corroborated through finding (1) high error segments (2) features driving the error and (3) comparing the distance between their distributions, that the distributional shift of latitude and longitude is a large source of error in Seattle.

4: Signaled by the feature importance test, the number of unimportant features that may be causing our overfitting.

Let's address these issues one at a time. First we can mitigate the error from latitude and longitude.

To do so, we will create new features for each city to be the distance from city center (by latitude, longitude and pairwise).

In [8]:
# load data
san_francisco = pd.read_csv(
    'https://truera-examples.s3.us-west-2.amazonaws.com/data/starter-performance/San_Francisco.csv')
seattle = pd.read_csv('https://truera-examples.s3.us-west-2.amazonaws.com/data/starter-performance/Seattle.csv')

#make all float
san_francisco = san_francisco.astype(float).reset_index(names="id")
seattle = seattle.astype(float).reset_index(names="id")

import math


# create a generalizable feature from lat and lon, distance from city center
def create_lat_lon_features(df, city_center_lat, city_center_lon):
    # calculate the distance from the mean latitude and longitude values
    df["lat_dist"] = df["latitude"].apply(lambda x: abs(x - city_center_lat))
    df["lon_dist"] = df["longitude"].apply(lambda x: abs(x - city_center_lon))

    # calculate the pairwise Euclidean distance between each latitude and longitude
    df["lat_lon_dist"] = df.apply(lambda x: math.sqrt(x["lat_dist"]**2 + x["lon_dist"]**2), axis=1)
    df = df.drop(['latitude', 'longitude'], axis=1)

    # return the modified dataframe
    return df


san_francisco_v2 = create_lat_lon_features(san_francisco, 37.7749, -122.4194)
seattle_v2 = create_lat_lon_features(seattle, 47.6062, -122.3321)

xgb_reg = xgb.XGBRegressor(eta=0.2, max_depth=4)
xgb_reg.fit(san_francisco_v2.drop(['price', 'id'], axis=1), san_francisco_v2.price)

In [9]:
# since we changed our feature space, we need to add a new data collection
tru.add_data_collection("Data Collection v2")

# add data to the collection we just created
tru.add_data(data=san_francisco_v2,
             data_split_name="San Francisco",
             column_spec=ColumnSpec(id_col_name="id",
                                    pre_data_col_names=list(san_francisco_v2.columns.drop(["id", "price"])),
                                    label_col_names="price"))
tru.add_data(data=seattle_v2,
             data_split_name="Seattle",
             column_spec=ColumnSpec(id_col_name="id",
                                    pre_data_col_names=list(seattle_v2.columns.drop(["id", "price"])),
                                    label_col_names="price"))
# add model
tru.add_python_model("model_2",
                     xgb_reg,
                     train_split_name="San Francisco",
                     train_parameters={
                         "model_type": "xgb.XGBRegressor",
                         "eta": 0.2,
                         "max_depth": 4
                     })

Uploading tmpr0vdmb5k.parquet (392.6KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...


Uploading tmpqown1iw_.parquet (264.9KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Waiting for data split to materialize...
INFO:truera.client.remote_truera_workspace:Uploading xgboost model: XGBRegressor
INFO:truera.client.remote_truera_workspace:Verifying model...
INFO:truera.client.remote_truera_workspace:✔️ Verified packaged model format.
INFO:truera.client.remote_truera_workspace:✔️ Loaded model in current environment.
INFO:truera.client.remote_truera_workspace:✔️ Called predict on model.
INFO:truera.client.remote_truera_workspace:✔️ Verified model output.
INFO:truera.client.remote_truera_workspace:Verification succeeded!


Uploading MLmodel (218.0B) -- ### -- file upload complete.
Uploading tmp0k27hen5.json (174.0KiB) -- ### -- file upload complete.
Uploading conda.yaml (208.0B) -- ### -- file upload complete.
Uploading xgboost_regression_predict_wrapper.py (459.0B) -- ### -- file upload complete.
Uploading xgboost_regression_predict_wrapper.cpython-310.pyc (1.1KiB) -- ### -- file upload complete.
Put resource done.


INFO:truera.client.remote_truera_workspace:Model "model_2" is added and associated with remote data collection "Data Collection v2". "model_2" is set as the model for the workspace context.


Model uploaded to: http://localhost:8000/home/p/Starter%20Example%20Companion%20-%20Performance/m/model_2/


INFO:truera.client.remote_truera_workspace:Triggering computations for model predictions on split San Francisco.
INFO:truera.client.remote_truera_workspace:Data collection in remote environment is now set to "Data Collection v2". 
INFO:truera.client.remote_truera_workspace:Setting remote model context to "model_2".


In [10]:
tru.tester.get_model_test_results(test_types=["performance"])

|          | 0.000% [00:00<?]

0,1,2,3,4,5,6
,Name,Split,Segment,Metric,Score,Navigate
,Relative MAE Test,Seattle,ALL POINTS,MAE,,Explore in UI
,RMSE Test,San Francisco,ALL POINTS,RMSE,,Explore in UI
,RMSE Test,Seattle,ALL POINTS,RMSE,,Explore in UI


### 🪄 Huzzah! Using the Test Harness as our guide, we quickly diagnosed the true cause of overfitting and improved performance.