# Guided Exercise: Performance Part 2
#### Goals 🎯
In this tutorial, you will use TruEra to make performance improvements to the model created in part 1 in a structured and methodical way!

If you missed part one and need to go back:     [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1gn8HfAD9G6L6XGhegAHjuBDucZbZH74W)

In this tutorial, you will:
1. view the results of performance and feature importance tests created in part 1
2. Find actionable issues with the model created in part 1
3. Mitigate these issues and re-upload your model to TruEra.
4. Retest the new model and confirm the effectivenesss of the mitigation strategy.

### First, set the credentials for your TruEra deployment.

In [None]:
#connection details
CONNECTION_STRING = ""
AUTH_TOKEN = ""

### Install the required packages

In [None]:
! pip install --upgrade shap
! pip install --upgrade truera

### From here, run the rest of the notebook and follow the analysis.

In [None]:
import pandas as pd
import xgboost as xgb
import logging

from truera.client.truera_workspace import TrueraWorkspace
from truera.client.truera_authentication import TokenAuthentication

auth = TokenAuthentication(AUTH_TOKEN)
tru = TrueraWorkspace(CONNECTION_STRING, auth, ignore_version_mismatch=True, log_level=logging.ERROR)

# set our environment to remote to view the test results from part one
tru.set_environment("remote")
# note: we'll periodically toggle between local and remote so we can interact with our remote deployment as well.

### First, let's review the test results from part 1.

In [None]:

# set project and data collection
tru.set_project("Starter Example - Performance")
tru.set_data_collection("Data Collection v1")

# get model results
tru.set_model("model_1")
tru.tester.get_model_test_results(test_types=["performance"])

0,1,2,3,4,5,6
,Name,Split,Segment,Metric,Score,Navigate
❌,Relative MAE Test,Seattle,ALL POINTS,MAE,123.4791,Explore in UI
❌,RMSE Test,Seattle,ALL POINTS,RMSE,161.2632,Explore in UI
✅,RMSE Test,San Francisco,ALL POINTS,RMSE,82.4993,Explore in UI


### Both tests fail.

### But can we narrow down the problem?

First, let's directly search for high error segments. 

We can do this in two ways: (1) find error segments in the test split alone (2) find segments in which the test split has higher error than train.

Let's do both!

In [None]:
train_split_name = "San Francisco"
test_split_name = "Seattle"

# generate the explainer and compute performance
explainer = tru.get_explainer(test_split_name, comparison_data_splits=[train_split_name])
explainer.compute_performance(metric_type="MAE")

# method 1, high error just in test
explainer.set_base_data_split(test_split_name)
explainer.suggest_high_error_segments(metric_types="MAE", max_num_responses = 5)

Unnamed: 0,representation,MAE,size,size (%)
0,amenities_Cable_TV <= 1.0 AND amenities_Cable_...,160.799057,1443,37.864078
1,room_type_Entire_home/apt <= 1.0 AND room_type...,154.746643,2538,66.596694
2,minimum_nights <= 3.0 AND minimum_nights >= 3.0,165.156097,480,12.595119
3,room_type_Private_room <= 0.0 AND room_type_Pr...,150.535126,2655,69.666754
4,longitude <= -122.4000448211261 AND longitude ...,188.002533,77,2.020467


In [None]:
# method 2 - high error in test relative to train
explainer.suggest_high_error_segments(metric_types="MAE", max_num_responses = 5, comparison_data_split_name = train_split_name)

Unnamed: 0,representation,MAE,size,size (%)
0,longitude <= -122.4000448211261 AND longitude ...,188.002533,77,2.020467
1,amenities_Cable_TV <= 1.0 AND amenities_Cable_...,160.799057,1443,37.864078
2,minimum_nights <= 3.0 AND minimum_nights >= 3.0,165.156097,480,12.595119
3,room_type_Entire_home/apt <= 1.0 AND room_type...,154.746643,2538,66.596694
4,room_type_Private_room <= 0.0 AND room_type_Pr...,150.535126,2655,69.666754


You can see that both ways return the same five high error segments, just in a different order.

One common reason for overfitting is a distributional shift between train and test splits. Are there distributional shifts in the features?

Since our features are on different scales, we should choose a distance metric that is scale invariant.

In [None]:
# look for feature drift using data_profiler
tru.set_data_split(train_split_name)
tru.data_profiler.compute_feature_drift(comparison_data_splits=[test_split_name],
    distance_metrics=["NUMERICAL_JENSEN_SHANNON_DISTANCE"])['Seattle'].\
        sort_values(by='NUMERICAL_JENSEN_SHANNON_DISTANCE', ascending=False)

Unnamed: 0,NUMERICAL_JENSEN_SHANNON_DISTANCE
latitude,1.000000
longitude,0.933833
amenities_Hangers,0.351672
amenities_Iron,0.331656
availability_365,0.319597
...,...
property_type_Yurt,0.002731
bed_type_Airbed,0.002666
amenities_Washer_/_Dryer,0.002586
amenities_Essentials,0.002526


Latitude and longitude have by far the largest distributional shift between San Francisco and Seattle.

### Analyze root cause of problem and attempt to mitigate issue

Example possible causes:
1. Mislabeled points
2. Train/test not from same distribution
3. Data pipeline error
4. Too many unimportant features
5. Insufficient test data
6. Target leakage in the training process

We can identify two issues:

2: We corroborated through finding (1) high error segments (2) features driving the error and (3) comparing the distance between their distributions, that the distributional shift of latitude and longitude is a large source of error in Seattle.

4: Signaled by the feature importance test, the number of unimportant features that may be causing our overfitting.

Let's address these issues one at a time. First we can mitigate the error from latitude and longitude.

To do so, we will create new features for each city to be the distance from city center (by latitude, longitude and pairwise).

In [None]:
# load data
san_francisco = pd.read_csv('https://truera-examples.s3.us-west-2.amazonaws.com/data/starter-performance/San_Francisco.csv')
seattle = pd.read_csv('https://truera-examples.s3.us-west-2.amazonaws.com/data/starter-performance/Seattle.csv')

import math

def create_lat_lon_features(df, city_center_lat, city_center_lon):

  # calculate the distance from the mean latitude and longitude values

  df["lat_dist"] = df["latitude"].apply(lambda x: abs(x - city_center_lat))
  df["lon_dist"] = df["longitude"].apply(lambda x: abs(x - city_center_lon))

  # calculate the pairwise Euclidean distance between each latitude and longitude
  df["lat_lon_dist"] = df.apply(lambda x: math.sqrt(x["lat_dist"]**2 + x["lon_dist"]**2), axis=1)
  df = df.drop(['latitude','longitude'], axis = 1)

  # return the modified dataframe
  return df

san_francisco_v2 = create_lat_lon_features(san_francisco, 37.7749, -122.4194)
seattle_v2 = create_lat_lon_features(seattle, 47.6062, -122.3321)

xgb_reg = xgb.XGBRegressor(eta = 0.2, max_depth = 4)
xgb_reg.fit(san_francisco_v2.drop('price', axis = 1), san_francisco_v2.price)



XGBRegressor(eta=0.2, max_depth=4)

In [None]:
# switch to local mode to add new data and model
tru.download_project("Starter Example - Performance")
tru.set_environment("local")
tru.set_project("Starter Example - Performance")

# since we changed our feature space, we need to add a new data collection
tru.add_data_collection("Data Collection v2")

# add data splits
tru.add_data_split("San Francisco", pre_data = san_francisco_v2.drop('price', axis = 1), label_data = san_francisco_v2['price'], split_type = "train")
tru.add_data_split("Seattle", pre_data = seattle_v2.drop('price', axis = 1), label_data = seattle_v2['price'], split_type = "test")

# add model
tru.add_python_model("model_2", xgb_reg, train_split_name="San Francisco", train_parameters = {"model_type":"xgb.XGBRegressor", "eta":0.2, "max_depth":4})

# sync with remote
tru.upload_project(upload_error_qiis=True)



In [None]:
#toggle back to remote to interact with the tester
tru.set_environment("remote")
tru.set_project("Starter Example - Performance")
tru.set_data_collection("Data Collection v2")
tru.set_model("model_2")

# test to confirm we fixed the feature importance isssue
tru.tester.add_feature_importance_test(
    test_name = 'Feature Importance Test 2',
    data_split_name_regex = 'San Francisco',
    background_split_name = 'San Francisco',
    min_importance_value=0.01,
    fail_if_greater_than = 15)

tru.tester.get_model_test_results(test_types=["performance","feature_importance"])

0,1,2,3,4,5,6
,Name,Split,Segment,Metric,Score,Navigate
,Relative MAE Test,Seattle,ALL POINTS,MAE,80.8256,Explore in UI
✅,RMSE Test,San Francisco,ALL POINTS,RMSE,83.8805,Explore in UI
✅,RMSE Test,Seattle,ALL POINTS,RMSE,108.7328,Explore in UI

0,1,2,3,4,5,6,7,8
,Name,Split,Segment,Background Split Name,Min. Importance Value,Score Type,Score,Navigate
❌,Feature Importance Test 2,San Francisco,ALL POINTS,San Francisco,0.01,regression,102,Explore in UI


Here we can see small improvement in the MAE from this change but we have not yet solved the issue.

Next we'll prune the model.

In [None]:
# prune features
explainer = tru.get_explainer('Seattle', comparison_data_splits=['San Francisco'])
global_feature_importances = explainer.get_global_feature_importances()

def prune_features(global_feature_importances, cutoff):
    feature_importance = global_feature_importances.T.rename(columns = {0:'importance'})
    pruned_feature_importance = feature_importance[feature_importance['importance'] >= cutoff]
    return list(pruned_feature_importance.index)

pruned_feature_list = prune_features(global_feature_importances, 0.005)

pruned_feature_list += ['price'] # don't leave off the target

In [None]:
# apply transformations
san_francisco_v3 = san_francisco_v2[pruned_feature_list]
seattle_v3 = seattle_v2[pruned_feature_list]

# train a new model
xgb_reg = xgb.XGBRegressor(eta = 0.2, max_depth = 4)
xgb_reg.fit(san_francisco_v3.drop('price', axis = 1), san_francisco_v3.price)



XGBRegressor(eta=0.2, max_depth=4)

In [None]:
# switch to local mode to add new data and model
tru.set_environment("local")
tru.add_data_collection("Data Collection v3")

tru.add_data_split("San Francisco", pre_data = san_francisco_v3.drop('price', axis = 1), label_data = san_francisco_v3['price'], split_type = "train")
tru.add_data_split("Seattle", pre_data = seattle_v3.drop('price', axis = 1), label_data = seattle_v3['price'], split_type = "test")

tru.add_python_model("model_3", xgb_reg, train_split_name="San Francisco", train_parameters = {"model_type":"xgb.XGBRegressor", "eta":0.2, "max_depth":4})

tru.upload_project(upload_error_qiis=True)



In [None]:
#toggle back to remote to interact with the tester
tru.set_environment("remote")
tru.set_project("Starter Example - Performance")
tru.set_data_collection("Data Collection v3")
tru.set_model("model_3")

tru.tester.get_model_test_results(test_types=["performance"])

0,1,2,3,4,5,6
,Name,Split,Segment,Metric,Score,Navigate
❌,RMSE Test,Seattle,ALL POINTS,RMSE,130.489,Explore in UI
,Relative MAE Test,Seattle,ALL POINTS,MAE,104.1811,Explore in UI
✅,RMSE Test,San Francisco,ALL POINTS,RMSE,109.9693,Explore in UI


### 🪄 Huzzah! Using the Test Harness as our guide, we quickly diagnosed the true cause of overfitting and improved performance.